2017 Edition

Anomaly Detection in Network Traffic with K-means clustering

We can categorize machine learning algorithms into two main groups: supervised learning and unsupervised learning. With supervised learning algorithms, in order to predict unknown values for new data, we have to know the target value for many previously-seen examples. In contrast, unsupervised learning algorithms explore the data which has no target attribute to find some intrinsic structures in them.

Clustering is a technique for finding similar groups in data, called clusters. Clustering is often called an unsupervised learning task as no class values denoting an a priori grouping of the data instances are given.

In this notebook, we will use K-means, a very well-known clustering algorithm to detect anomaly network connections based on statistics about each of them. A thorough overview of K-means clustering, from a research perspective, can be found in the following wonderful tutorial.

Goals

We expect students to:

  • Learn (or revise) and understand the K-means algorithm
  • Implement a simple K-means algorithm
  • Use K-means to detect anomalies network connection data

Steps

  1. In section 1, we will have an overview about K-means then implement a simple version of it.
  2. In section 2, we build models with and without categorical features.
  3. Finally, in the last section, using our models, we will detect unusual connections.

1. K-means

1.1. Introduction

Clustering is a typical and well-known type of unsupervised learning. Clustering algorithms try to find natural groupings in data. Similar data points (according to some notion of similarity) are considered in the same group. We call these groups clusters.

K-Means clustering is a simple and widely-used clustering algorithm. Given value of $k$, it tries to build $k$ clusters from samples in the dataset. Therefore, $k$ is an hyperparameter of the model. The right value of $k$ is not easy to determine, as it highly depends on the data set and the way that data is featurized.

To measure the similarity between any two data points, K-means requires the definition of a distance function between data points. What is a distance? It is a value that indicates how close two data points are in their space. In particular, when data points lie in a $d$-dimensional space, the Euclidean distance is a good choice of a distance function, and is supported by MLLIB.

In K-means, a cluster is a group of points, with a representative entity called a centroid. A centroid is also a point in the data space: the center of all the points that make up the cluster. It's defined to be the arithmetic mean of the points. In general, when working with K-means, each data sample is represented in a $d$-dimensional numeric vector, for which it is easier to define an appropriate distance function. As a consequence, in some applications, the original data must be transformed into a different representation, to fit the requirements of K-means.

1.2. How does it work?

Given $k$, the K-means algorithm works as follows:

  1. Randomly choose $k$ data points (seeds) to be the initial centroids
  2. Assign each data point to the closest centroid
  3. Re-compute (update) the centroids using the current cluster memberships
  4. If a convergence criterion is not met, go to step 2

We can also terminate the algorithm when it reaches an iteration budget, which yields an approximate result. From the pseudo-code of the algorithm, we can see that K-means clustering results can be sensitive to the order in which data samples in the data set are explored. A sensible practice would be to run the analysis several times, randomizing objects order; then, average the cluster centers of those runs and input the centers as initial ones for one final run of the analysis.

1.3. Illustrative example

One of the best ways to study an algorithm is trying implement it. In this section, we will go step by step to implement a simple K-means algorithm.

Question 1

Question 1.1

Complete the below function to calculate an Euclidean distance between any two points in $d$-dimensional data space
In [1]:
import numpy as np

# calculate distance between two d-dimensional points
def euclidean_distance(p1, p2):
    return np.linalg.norm(np.array(p1)-np.array(p2))

# test our function
assert (round(euclidean_distance([1,2,3] , [10,18,12]), 2) == 20.45), "Function's wrong"

Instead of using the normal calculation like this : $$np.sqrt(np.sum((np.array(p1)-np.array(p2))**2))$$We take advantage of the norm() function in numpy library

Question 1.2

Given a data point and the current set of centroids, complete the function below to find the index of the closest centroid for that data point.
In [2]:
def find_closest_centroid(datapoint, centroids):
    # find the index of the closest centroid of the given data point.
    Dist_to_centroid = np.array(list(map(lambda x: euclidean_distance(x , datapoint), centroids)))
    print (Dist_to_centroid)
    cen_idx = np.where(Dist_to_centroid==Dist_to_centroid.min())[0]
    return cen_idx

assert(find_closest_centroid( [1,1,1], [ [2,1,2], [1,2,1], [3,1,2] ] ) == 1), "Function's wrong"
[ 1.41421356  1.          2.23606798]

Dist_each_point is a list of distance of a point to each centroid

Then return the order of the centroid which has the smallest distance

Question 1.3

Write a function to randomize `k` initial centroids.
In [3]:
np.random.seed(22324)

# randomize initial centroids
def randomize_centroids(data, k):
    np.random.seed()
    np.random.shuffle(data)
    return data[:k]

assert(len(
    randomize_centroids(
        np.array([ 
            np.array([2,1,2]), 
            np.array([1,2,1]), 
            np.array([3,1,2]) 
             ]), 
        2)) == 2), "Wrong function"

Question 1.4

Write function `check_converge` to check the stop criteria of the algorithm.
In [4]:
MAX_ITERATIONS = 100

# return True if clusters have converged , otherwise, return False  
def check_converge(centroids, old_centroids, num_iterations, threshold=0):
    # if it reaches an iteration budget
    if num_iterations >= MAX_ITERATIONS:
        return True
    # check if the centroids don't move (or very slightly)
    delta_dist_list = np.array(list(map(lambda i: euclidean_distance(centroids[i] , old_centroids[i]), range(0,len(centroids)))))
    sumDist = delta_dist_list.sum()
    if sumDist <= threshold:
        return True
    return False

delta_dist_list is a list of how much each centroid changes its distance

Question 1.5

Write function `update_centroid` to update the new positions for the current centroids based on the position of their members.
In [5]:
# centroids: a list of centers
# cluster: a list of k elements. Each element i-th is a list of data points that are assigned to center i-th
def update_centroids(centroids, cluster):
    centroids = np.array(list(map(lambda x: np.mean(x, axis=0), cluster)))
    return centroids

Question 1.6

Complete the K-means algorithm skeleton below, with the functions you wrote above.
In [6]:
# data : set of data points
# k : number of clusters
# centroids: initial list of centroids
def kmeans(data, k=2, centroids=None):
    
    # randomize the centroids if they are not given
    if not centroids:
        centroids = randomize_centroids(data, k)

    #old_centroids = centroids[:]

    iterations = 0
    while True:
        iterations += 1

        # init empty clusters
        clusters = [[] for i in range(k)]

        # assign each data point to the closest centroid
        for datapoint in data:
            # find the closest center of each data point
            centroid_idx = find_closest_centroid(datapoint, centroids)
            
            # assign datapoint to the closest cluster
            clusters[centroid_idx].append(np.array(datapoint))
        
        # keep the current position of centroids before changing them
        old_centroids = centroids[:]
        
        # update centroids
        centroids = update_centroids(centroids, clusters)
        
        # if the stop criteria are met, stop the algorithm
        if check_converge(centroids, old_centroids, iterations, threshold=0) == True:
            break
    
    return centroids

Next, we will test our algorithm on Fisher's Iris dataset, and plot the resulting clusters in 3D.

Question 1.7

The code below can be used to test your algorithm with three different datasets: `Iris`, `Moon` and `Blob`. Run your algorithm to cluster datapoints in these datasets, plot the results and discuss about them. Do you think that our algorithm works well? Why?
In [7]:
# the sourcecode in this cell is inspired from 
# https://gist.github.com/bbarrilleaux/9841297

%matplotlib inline

from sklearn import datasets, cluster
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
from sklearn.decomposition import PCA
# load data
iris = datasets.load_iris()
X_iris = iris.data
y_iris = iris.target
# do the clustering
centers = kmeans(X_iris, k=3)
labels = [find_closest_centroid(p, centers) for p in X_iris]

#plot the clusters in color
#fig = plt.figure(1, figsize=(8, 8))
#plt.clf()
#ax = Axes3D(fig, rect=[0, 0, 1, 1], elev=8, azim=200)
#plt.cla()
#ax.scatter(X_iris[:, 3], X_iris[:, 0], X_iris[:, 2], c=labels)

# moon
# np.random.seed(0)
# X, y = datasets.make_moons(2000, noise=0.2)

# blob
# np.random.seed(0)
# X, y = datasets.make_blobs(n_samples=2000, centers=3, n_features=20, random_state=0)

# centers = kmeans(X, k=3)
# labels = [find_closest_centroid(p, centers) for p in X]

# fig = plt.figure(1, figsize=(8, 8))
# plt.clf()
# plt.scatter(X[:,0], X[:,1], s=40, c=labels, cmap=plt.cm.Spectral)

#ax.w_xaxis.set_ticklabels([])
#ax.w_yaxis.set_ticklabels([])
#ax.w_zaxis.set_ticklabels([])
#ax.set_xlabel('Petal width')
#ax.set_ylabel('Sepal length')
#ax.set_zlabel('Petal length')

#plt.show()

# Here we use sci-kit learn implementation of K-means
centers2 =cluster.KMeans(n_clusters=3)
centers2.fit(X_iris) 
labels2 = centers2.labels_
/opt/conda/lib/python3.5/site-packages/sklearn/utils/fixes.py:64: DeprecationWarning: inspect.getargspec() is deprecated, use inspect.signature() instead
  if 'order' in inspect.getargspec(np.copy)[0]:
[ 0.          4.95479566  0.72801099]
[ 4.95479566  0.          5.10294033]
[ 0.72801099  5.10294033  0.        ]
[ 0.83066239  4.9132474   1.55563492]
[ 2.98998328  2.29564806  2.98831056]
[ 2.98496231  2.14009346  3.03644529]
[ 2.45153013  2.83901391  2.38746728]
[ 4.2296572   1.08627805  4.37492857]
[ 2.98998328  2.22934968  3.02489669]
[ 2.9240383   2.35372046  2.94957624]
[ 3.0757113   2.59036677  2.94788059]
[ 3.79868398  1.57480157  3.87298335]
[ 4.19285106  1.33790882  4.22018957]
[ 0.36055513  5.0019996   0.37416574]
[ 0.24494897  5.02294734  0.75498344]
[ 4.13521463  1.62172747  4.14366987]
[ 1.04880885  5.48543526  0.55677644]
[ 3.10322413  2.38327506  3.01330383]
[ 0.34641016  5.03487835  0.75498344]
[ 0.42426407  5.19903837  0.5       ]
[ 0.50990195  5.12347538  0.26457513]
[ 0.57445626  4.74552421  0.52915026]
[ 5.4101756   0.50990195  5.54436651]
[ 0.78740079  5.08822169  0.26457513]
[ 0.36055513  5.07346036  1.00995049]
[ 4.50666174  1.19163753  4.64112055]
[ 4.11703777  0.9486833   4.24970587]
[ 4.42040722  1.03440804  4.48218697]
[ 3.24037035  1.83030052  3.318132  ]
[ 3.74966665  1.6643317   3.8457769 ]
[ 2.68514432  2.83196045  2.54950976]
[ 0.72801099  5.10294033  0.        ]
[ 4.05709256  1.161895    4.11946598]
[ 3.48568501  1.54272486  3.63593179]
[ 3.13528308  2.06397674  3.20312348]
[ 0.3         5.09705797  0.46904158]
[ 1.2083046   5.40092585  0.55677644]
[ 0.8660254   5.66745093  0.77459667]
[ 5.64092191  0.91104336  5.81807528]
[ 0.65574385  5.29150262  0.34641016]
[ 4.05092582  1.42126704  4.07185461]
[ 4.53321078  1.04880885  4.66690476]
[ 3.48998567  1.64012195  3.57071421]
[ 0.46904158  4.62493243  1.1       ]
[ 3.39705755  1.81383571  3.54259792]
[ 1.28840987  5.72276157  0.72801099]
[ 2.79821372  2.73861279  2.67207784]
[ 2.83019434  2.41246762  2.81780056]
[ 0.31622777  5.07050293  0.5       ]
[ 2.84604989  2.58263431  2.75862284]
[ 4.94469413  0.73484692  4.9979996 ]
[ 0.1         4.91731634  0.78740079]
[ 2.86530976  2.78208555  2.77488739]
[ 4.13400532  1.81659021  4.10609303]
[ 3.26496554  2.2737634   3.19843712]
[ 1.15758369  5.48178803  0.55677644]
[ 4.08411557  1.65227116  4.02864742]
[ 3.65513338  1.4525839   3.80657326]
[ 4.52658812  0.78740079  4.63249393]
[ 0.36055513  4.72228758  1.00995049]
[ 0.81240384  5.15848815  0.17320508]
[ 0.33166248  5.04777179  0.4472136 ]
[ 0.37416574  4.70850295  0.6244998 ]
[ 0.76811457  4.94368284  0.2       ]
[ 0.59160798  5.05173238  0.34641016]
[ 0.33166248  5.14975728  0.52915026]
[ 4.34281015  1.00498756  4.45757782]
[ 3.04959014  2.21585198  3.06757233]
[ 0.78740079  5.0368641   0.17320508]
[ 2.95127091  2.34946802  2.9189039 ]
[ 0.72801099  5.10294033  0.        ]
[ 0.87177979  5.23163454  0.17320508]
[ 4.34281015  1.7691806   4.3760713 ]
[ 4.73708771  0.98488578  4.81144469]
[ 3.68103246  1.40712473  3.77094153]
[ 4.66261729  0.71414284  4.81767579]
[ 5.16720427  0.51961524  5.31130869]
[ 0.54772256  5.0645829   1.26095202]
[ 6.06712452  1.43527001  6.30713881]
[ 5.16720427  1.32287566  5.29244745]
[ 6.07042008  1.12249722  6.21128006]
[ 4.48330235  1.20415946  4.49332839]
[ 0.24494897  4.96487663  0.74161985]
[ 0.70710678  4.80520551  0.51961524]
[ 3.96232255  1.88944436  3.96862697]
[ 1.62788206  5.45160527  0.93808315]
[ 0.8660254   5.30282943  0.34641016]
[ 3.77491722  1.21655251  3.92937654]
[ 2.94108823  2.30217289  2.92232784]
[ 4.98598034  0.79372539  5.14684369]
[ 3.3436507   2.0174241   3.34215499]
[ 3.90512484  1.43527001  3.98748041]
[ 5.83266663  1.28452326  6.09179776]
[ 5.16430053  0.34641016  5.33291665]
[ 5.12542681  0.83666003  5.2744668 ]
[ 2.98998328  2.54361947  2.9       ]
[ 3.96106046  1.37113092  3.97994975]
[ 3.09677251  2.30217289  3.06267857]
[ 3.58468967  2.69072481  3.4525353 ]
[ 0.80622577  5.33853913  0.31622777]
[ 4.58366665  0.9486833   4.65832588]
[ 4.01497198  1.32287566  4.07062649]
[ 4.52990066  1.3453624   4.67225855]
[ 5.0049975   1.07703296  5.16526863]
[ 0.87177979  5.17783739  0.26457513]
[ 2.39374184  3.50142828  2.13072758]
[ 0.2236068   4.96990946  0.50990195]
[ 0.42426407  5.0447993   0.33166248]
[ 2.13072758  3.57071421  1.91049732]
[ 2.76043475  3.38969025  2.45560583]
[ 4.89591667  0.96953597  5.03388518]
[ 4.94165964  0.92195445  5.04678115]
[ 3.70809924  1.82756669  3.66878727]
[ 0.65574385  5.10685813  1.3114877 ]
[ 3.54682957  1.83030052  3.52845575]
[ 3.46842904  1.53622915  3.60555128]
[ 0.92195445  5.21727898  0.31622777]
[ 0.5         4.65832588  0.88317609]
[ 4.72757866  0.70710678  4.86004115]
[ 4.13521463  1.62172747  4.14366987]
[ 3.48855271  1.50332964  3.65239647]
[ 2.43515913  3.45543051  2.16794834]
[ 5.49909083  0.55677644  5.63293884]
[ 3.84967531  1.15325626  4.04351332]
[ 5.48361195  1.12249722  5.69736781]
[ 3.6         1.57162336  3.66196668]
[ 4.76864761  1.28452326  4.91426495]
[ 0.77459667  5.28866713  0.47958315]
[ 3.35708207  2.06397674  3.38526218]
[ 0.42426407  5.09411425  1.04403065]
[ 4.02119385  0.9591663   4.18568991]
[ 3.62629287  1.37840488  3.73898382]
[ 3.88200979  1.44222051  3.94461658]
[ 2.51992063  2.73861279  2.52586619]
[ 4.77807493  1.03923048  4.85386444]
[ 3.43802269  1.77482393  3.49428104]
[ 4.04598566  1.48996644  4.10609303]
[ 4.50444225  1.03923048  4.66904701]
[ 5.94978991  1.02469508  6.10819122]
[ 3.92300905  1.11355287  4.0348482 ]
[ 0.54772256  4.87134478  0.65574385]
[ 3.3015148   1.72916165  3.40147027]
[ 4.48664685  0.88317609  4.59782557]
[ 0.47958315  4.90713766  0.4472136 ]
[ 6.36710295  1.4525839   6.48768063]
[ 0.28284271  4.982971    0.7       ]
[ 0.37416574  4.83011387  0.65574385]
[ 3.56230263  1.61864141  3.60277671]
[ 4.83425279  0.77459667  4.99199359]
[ 3.34514574  2.22261108  3.34962684]
[ 0.14411107  4.08076224  0.96873709]
[ 4.85414957  1.14180531  4.67825807]
[ 0.7367279   4.1702748   0.54306735]
[ 0.83640182  4.12129082  1.70403805]
[ 2.87429435  1.23523571  2.4999256 ]
[ 2.87227575  1.16495749  2.57925749]
[ 2.34204355  1.85827068  1.91501308]
[ 4.1130485   0.42086334  3.9150064 ]
[ 2.87749335  1.19948918  2.5489753 ]
[ 2.80719932  1.28951163  2.46210905]
[ 2.96161578  1.54609917  2.44063247]
[ 3.67928906  0.44817654  3.38859399]
[ 4.07568007  0.45074373  3.73602239]
[ 0.36710761  4.09067347  0.67839241]
[ 0.24488365  4.12157078  0.96234018]
[ 4.01516724  0.6236058   3.63968063]
[ 1.05828541  4.50434279  0.83714067]
[ 3.00206063  1.50442009  2.55853447]
[ 0.40394059  4.17747553  1.05478396]
[ 0.44177822  4.27364587  0.83326701]
[ 0.52380149  4.19841475  0.66570071]
[ 0.47704088  3.79982426  0.46612776]
[ 5.30633282  1.4962167   5.10958544]
[ 0.77250761  4.12881653  0.47735031]
[ 0.37890368  4.20598535  1.23630561]
[ 4.3856548   0.68675428  4.16677105]
[ 4.00703981  0.35608465  3.8059835 ]
[ 4.3053418   0.37428997  4.00502388]
[ 3.13049006  0.87548079  2.86709195]
[ 3.62987162  0.59824341  3.36376607]
[ 2.57751198  1.82870416  2.05774972]
[ 0.7367279   4.1702748   0.54306735]
[ 3.94795745  0.27110805  3.65500202]
[ 3.37454708  0.65932418  3.19736727]
[ 3.01654902  0.99260154  2.72627387]
[ 0.33877426  4.19046335  0.80599505]
[ 1.20431225  4.41220642  0.73619357]
[ 0.91453157  4.71932951  1.16527393]
[ 5.53382038  1.80051416  5.38857002]
[ 0.68554212  4.36539648  0.79866346]
[ 3.93766022  0.44150668  3.58611437]
[ 4.41383824  0.68531265  4.20219604]
[ 3.37851565  0.61882964  3.10818347]
[ 0.44087186  3.8024406   1.19959985]
[ 3.28023902  0.82507716  3.08405459]
[ 1.31299962  4.74025137  1.04722818]
[ 2.68781845  1.71719955  2.17409449]
[ 2.71697773  1.39661467  2.33972945]
[ 0.31743976  4.15492355  0.78529412]
[ 2.7358304   1.56706649  2.26798273]
[ 4.83598677  0.92841861  4.53226133]
[ 0.17767386  4.0559527   1.0176003 ]
[ 2.74393294  1.67587426  2.25079955]
[ 4.01152938  0.82327717  3.59201464]
[ 3.15416677  1.23878912  2.70091135]
[ 1.16205336  4.49599145  0.80708905]
[ 3.9737096   0.83210617  3.54038612]
[ 3.54219819  0.53850078  3.3583406 ]
[ 4.41528799  0.48988069  4.16952301]
[ 0.28769428  3.85423766  1.0907007 ]
[ 0.81189162  4.21495045  0.56743367]
[ 0.36161858  4.14855789  0.78229213]
[ 0.34171333  3.83280872  0.7496539 ]
[ 0.74804278  4.00828601  0.41185923]
[ 0.58066169  4.10857915  0.57821614]
[ 0.37251577  4.23667719  0.85280425]
[ 4.22638948  0.35095571  3.99038312]
[ 2.93447917  1.15524823  2.5825623 ]
[ 0.76991428  4.08383097  0.41257274]
[ 2.83936049  1.33572412  2.4382211 ]
[ 0.7367279   4.1702748   0.54306735]
[ 0.88451569  4.28768805  0.61470588]
[ 4.21501696  0.87948825  3.86487711]
[ 4.62029956  0.6768419   4.32954064]
[ 3.5712698   0.53265313  3.32365565]
[ 4.54928214  0.76753109  4.36849467]
[ 5.05734792  1.19324316  4.86308832]
[ 0.60130525  4.24565513  1.49023639]
[ 5.96526345  2.35507043  5.89228647]
[ 5.04713463  1.28699523  4.8094495 ]
[ 5.96861525  2.15028355  5.77908845]
[ 4.37757558  0.76466226  4.01931862]
[ 0.26223653  4.0724867   0.9403903 ]
[ 0.66450583  3.86057626  0.47980857]
[ 3.83812037  0.8253435   3.45605757]
[ 1.61355756  4.46121067  0.99627767]
[ 0.8696942   4.33665499  0.66348795]
[ 3.66807415  0.56915651  3.50288685]
[ 2.8354132   1.33646434  2.45576963]
[ 4.8710541   1.06166188  4.69131661]
[ 3.23121773  0.96469811  2.85496138]
[ 3.78718471  0.31654838  3.50783725]
[ 5.73494272  2.25215028  5.69588952]
[ 5.0610244   1.30694872  4.90308232]
[ 5.01058559  1.15386813  4.81268958]
[ 2.87415518  1.47837326  2.39292679]
[ 3.85128135  0.49833196  3.51060307]
[ 2.98173909  1.23341063  2.56576644]
[ 3.46323086  1.61135552  2.91863865]
[ 0.81998049  4.38217977  0.71552845]
/opt/conda/lib/python3.5/site-packages/ipykernel/__main__.py:25: DeprecationWarning: converting an array with ndim > 0 to an index will result in an error in the future
[ 4.47137205  0.56333448  4.18346659]
[ 3.89871363  0.27151309  3.59373373]
[ 4.40871501  0.84364764  4.19554159]
[ 4.88593573  1.13369212  4.69927201]
[ 0.85695274  4.21557612  0.54035263]
[ 2.28463739  2.44346732  1.60337285]
[ 0.245699    4.07422677  0.78229213]
[ 0.42891491  4.12255717  0.65050553]
[ 2.01434059  2.5167755   1.38231852]
[ 2.65721057  2.36659087  1.94926346]
[ 4.77723435  0.95156587  4.56523742]
[ 4.82526352  0.90401097  4.57045291]
[ 3.5969387   0.93947882  3.194054  ]
[ 0.71747334  4.31536797  1.57860095]
[ 3.44046043  0.99010745  3.07153608]
[ 3.36136401  0.76889003  3.18048902]
[ 0.91453157  4.247376    0.55749843]
[ 0.40640866  3.74640995  0.85693285]
[ 4.61358516  0.73796916  4.40196264]
[ 4.01516724  0.6236058   3.63968063]
[ 3.38295256  0.82274308  3.23840626]
[ 2.3271373   2.40854354  1.64716649]
[ 5.39673679  1.57070874  5.1970222 ]
[ 3.74912897  0.85000622  3.64541353]
[ 5.37121662  1.69808703  5.25678092]
[ 3.4891787   0.54458838  3.19552699]
[ 4.64792083  0.99784637  4.43816123]
[ 0.76939457  4.32031804  0.73258914]
[ 3.239069    0.9666895   2.89089521]
[ 0.49312067  4.24044947  1.2857517 ]
[ 3.91739301  0.6345238   3.76761491]
[ 3.52243779  0.66529727  3.31124356]
[ 3.76419553  0.31931351  3.46515152]
[ 2.39824269  1.68796145  2.03489668]
[ 4.65973905  0.73588148  4.36943715]
[ 3.32264473  0.69241147  3.01841092]
[ 3.92820162  0.47576961  3.6168208 ]
[ 4.38674914  0.80764521  4.21923938]
[ 5.84588471  2.02557402  5.67379776]
[ 3.81968166  0.61410581  3.60851669]
[ 0.44448622  3.91636249  0.65680494]
[ 3.19398936  0.86181805  2.96552422]
[ 4.37478777  0.49943332  4.13310648]
[ 0.41324085  3.96528682  0.55006291]
[ 6.26117944  2.37932593  6.04154088]
[ 0.22620345  4.06732958  0.87089124]
[ 0.31490951  3.9363458   0.80343661]
[ 3.45652542  0.69439247  3.14365539]
[ 4.72105581  0.91193829  4.53595879]
[ 3.2264482   1.13058607  2.84660455]
[ 0.12269274  4.0004793   0.95590795]
[ 4.98925242  1.21975119  4.89470735]
[ 0.66193425  4.08425346  0.39668627]
[ 0.89839995  4.05052482  1.7347507 ]
[ 2.99192333  1.14595359  2.71325635]
[ 2.99635728  1.07996974  2.79595422]
[ 2.45038035  1.7697932   2.12672518]
[ 4.24672917  0.48053925  4.13199226]
[ 2.99871554  1.11387146  2.76249163]
[ 2.92661127  1.20324269  2.67502523]
[ 3.06344719  1.4593382   2.64245341]
[ 3.80583287  0.3866001   3.60307646]
[ 4.19916474  0.44309664  3.94950123]
[ 0.29045456  4.00713962  0.62206109]
[ 0.19621871  4.04102416  0.93560675]
[ 4.13520025  0.59815798  3.84928045]
[ 0.95405761  4.41763526  0.63510629]
[ 3.10931056  1.42342063  2.76256403]
[ 0.36561482  4.09706211  1.02925216]
[ 0.3160397   4.19045061  0.74037828]
[ 0.42655604  4.11388417  0.55871281]
[ 0.49502879  3.71549139  0.50809448]
[ 5.4407991   1.57925393  5.32636461]
[ 0.70604659  4.04257051  0.31394267]
[ 0.37076583  4.12806468  1.22137627]
[ 4.51765331  0.74991753  4.38202693]
[ 4.14049202  0.40010578  4.02369979]
[ 4.43270195  0.43670314  4.22006635]
[ 3.25796928  0.79364747  3.08404929]
[ 3.75736771  0.55140091  3.57717207]
[ 2.6771482   1.7397561   2.26092017]
[ 0.66193425  4.08425346  0.39668627]
[ 4.07514143  0.25795924  3.86990439]
[ 3.50854401  0.60473931  3.41510761]
[ 3.14174645  0.90719419  2.94233921]
[ 0.21707711  4.10765054  0.73740084]
[ 1.11720945  4.32414265  0.52322079]
[ 0.78094884  4.6354478   1.0038725 ]
[ 5.67108988  1.88532746  5.60629646]
[ 0.57315075  4.28051317  0.66072687]
[ 4.060179    0.40476491  3.79817851]
[ 4.54648994  0.75282887  4.41884148]
[ 3.50633197  0.53723952  3.32423826]
[ 0.55449734  3.72596202  1.25990476]
[ 3.41192913  0.76708407  3.29899379]
[ 1.19822501  4.65262932  0.83051791]
[ 2.78918201  1.62829194  2.37784777]
[ 2.8334611   1.30754272  2.55385199]
[ 0.20649386  4.07212082  0.72564454]
[ 2.84287707  1.47799176  2.47567365]
[ 4.96347539  0.99847866  4.74634175]
[ 0.20227599  3.9762945   1.0161496 ]
[ 2.84681644  1.58931368  2.45254154]
[ 4.12714543  0.78943628  3.79849444]
[ 3.26431358  1.15482739  2.90643424]
[ 1.06344515  4.40838232  0.59208108]
[ 4.08832042  0.79062294  3.74723365]
[ 3.67600116  0.49925074  3.57527062]
[ 4.54664162  0.57561674  4.38558548]
[ 0.40626778  3.77647339  1.1358521 ]
[ 0.7328864   4.12849363  0.40516663]
[ 0.27079396  4.06559257  0.72536887]
[ 0.39464308  3.75084452  0.78902471]
[ 0.70482455  3.92198415  0.31962478]
[ 0.51349216  4.02381986  0.47974993]
[ 0.24681081  4.15409954  0.77547405]
[ 4.35801502  0.42948764  4.20715581]
[ 3.05420316  1.06735716  2.79602575]
[ 0.71042852  3.99732739  0.24690079]
[ 2.95393689  1.2466126   2.65053957]
[ 0.66193425  4.08425346  0.39668627]
[ 0.79733504  4.20095243  0.42727041]
[ 4.33556484  0.88062552  4.07275828]
[ 4.74870782  0.75227519  4.5442007 ]
[ 3.70054307  0.47399153  3.54058752]
[ 4.68451798  0.85053981  4.58625773]
[ 5.19242337  1.28171485  5.08064563]
[ 0.62302244  4.17096627  1.48787096]
[ 6.10609236  2.44192574  6.10872818]
[ 5.17805899  1.36063758  5.02220669]
[ 6.10383303  2.23515502  5.99557837]
[ 4.49941204  0.78573297  4.22858842]
[ 0.24470614  3.99195562  0.92312513]
[ 0.65906286  3.77581134  0.47849765]
[ 3.95594948  0.78321004  3.66324446]
[ 1.54100898  4.37200865  0.84697107]
[ 0.7738518   4.25027956  0.46963816]
[ 3.80284173  0.55366323  3.72050534]
[ 2.95148444  1.24953377  2.66753819]
[ 5.00660444  1.14831455  4.90862099]
[ 3.34950615  0.87979712  3.06596804]
[ 3.91498691  0.26786434  3.7229773 ]
[ 5.87659458  2.33512162  5.9118322 ]
[ 5.1977666   1.39234561  5.12013281]
[ 5.14529096  1.24175667  5.02940951]
[ 2.98060716  1.38972466  2.59910754]
[ 3.97416897  0.45124971  3.72394415]
[ 3.0956396   1.14477129  2.77599712]
[ 3.56025257  1.54349483  3.10672818]
[ 0.71260938  4.29613213  0.52892343]
[ 4.5997247   0.63367418  4.39760844]
[ 4.02490759  0.23112328  3.80906288]
[ 4.54022843  0.89726509  4.40873678]
[ 5.02046348  1.21410171  4.9151765 ]
[ 0.77895934  4.12877116  0.35799441]
[ 2.35889039  2.35444643  1.78341246]
[ 0.17286149  3.99187733  0.7497733 ]
[ 0.34424088  4.03857458  0.57285251]
[ 2.09298131  2.42737883  1.57434431]
[ 2.72862572  2.27982118  2.11966035]
[ 4.91053249  1.03334068  4.78160642]
[ 4.95627972  0.98714722  4.7857455 ]
[ 3.71235845  0.8736187   3.40313973]
[ 0.7298217   4.24100534  1.57421727]
[ 3.55797575  0.92041728  3.28197502]
[ 3.49426403  0.71667726  3.39784638]
[ 0.83431888  4.16043884  0.3517954 ]
[ 0.49153355  3.6660721   0.91463654]
[ 4.74721897  0.82505129  4.61932463]
[ 4.13520025  0.59815798  3.84928045]
[ 3.51786844  0.78201213  3.45551154]
[ 2.40220123  2.31941113  1.82837633]
[ 5.53121721  1.65494249  5.41333169]
[ 3.88630022  0.85396212  3.8616266 ]
[ 5.51007401  1.78403839  5.47362403]
[ 3.6154675   0.46152064  3.41082981]
[ 4.78004672  1.06393592  4.65164057]
[ 0.67355341  4.23508181  0.57979307]
[ 3.35911807  0.88734227  3.10157379]
[ 0.4819698   4.16301689  1.26402532]
[ 4.05303867  0.65660209  3.98484128]
[ 3.65356015  0.61954389  3.52779818]
[ 3.89060119  0.24891692  3.68061951]
[ 2.51414311  1.60028684  2.24974665]
[ 4.78808345  0.80950168  4.5839459 ]
[ 3.44782066  0.60404992  3.2339697 ]
[ 4.0532939  0.4560716  3.82912  ]
[ 4.52162067  0.86979479  4.43627772]
[ 5.98237801  2.11297609  5.89102368]
[ 3.95093421  0.60577193  3.82473006]
[ 0.44398477  3.83322662  0.65981816]
[ 3.32356555  0.7899639   3.18260271]
[ 4.50618943  0.57886495  4.34867336]
[ 0.38266541  3.88124245  0.52283841]
[ 6.39513973  2.46430375  6.25793576]
[ 0.1853749   3.98602784  0.84956459]
[ 0.33613173  3.8544997   0.81789975]
[ 3.58063061  0.61920753  3.35734419]
[ 4.85623152  0.99826999  4.7526582 ]
[ 3.34244672  1.0550875   3.05322125]
[ 0.13287421  4.0004793   0.97431422]
[ 4.98052764  1.21975119  4.90196779]
[ 0.65191172  4.08425346  0.40429551]
[ 0.90902267  4.05052482  1.75455261]
[ 2.97976323  1.14595359  2.71801794]
[ 2.98501629  1.07996974  2.80269897]
[ 2.43741439  1.7697932   2.13114715]
[ 4.23693941  0.48053925  4.1398617 ]
[ 2.98736041  1.11387146  2.767843  ]
[ 2.91484515  1.20324269  2.68035225]
[ 3.04964297  1.4593382   2.6437262 ]
[ 3.79477565  0.3866001   3.60914877]
[ 4.18748002  0.44309664  3.95444326]
[ 0.27986108  4.00713962  0.63910473]
[ 0.20080394  4.04102416  0.95356604]
[ 4.12322554  0.59815798  3.85315302]
[ 0.94709497  4.41763526  0.6325516 ]
[ 3.0967169   1.42342063  2.76513198]
[ 0.36921839  4.09706211  1.04687863]
[ 0.31143039  4.19045061  0.75616678]
[ 0.41751913  4.11388417  0.57310982]
[ 0.47852784  3.71549139  0.52926508]
[ 5.43166539  1.57925393  5.33331868]
[ 0.69497402  4.04257051  0.32034595]
[ 0.38164847  4.12806468  1.24068054]
[ 4.50751102  0.74991753  4.38948989]
[ 4.13081778  0.40010578  4.03135067]
[ 4.4217254   0.43670314  4.2255321 ]
[ 3.24730487  0.79364747  3.09086636]
[ 3.74668594  0.55140091  3.58366035]
[ 2.66351689  1.7397561   2.26240024]
[ 0.65191172  4.08425346  0.40429551]
[ 4.06451993  0.25795924  3.87525761]
[ 3.49899827  0.60473931  3.42358801]
[ 3.13044015  0.90719419  2.94903174]
[ 0.21210584  4.10765054  0.75395946]
[ 1.10739434  4.32414265  0.51165242]
[ 0.78251021  4.6354478   1.01124751]
[ 5.662419    1.88532746  5.61464943]
[ 0.56714686  4.28051317  0.67153173]
[ 4.04874329  0.40476491  3.80231792]
[ 4.53633357  0.75282887  4.42663772]
[ 3.49575775  0.53723952  3.33055874]
[ 0.55766378  3.72596202  1.27969587]
[ 3.40210556  0.76708407  3.30708152]
[ 1.19233198  4.65262932  0.8231777 ]
[ 2.77554599  1.62829194  2.37945124]
[ 2.821168    1.30754272  2.55883206]
[ 0.19745604  4.07212082  0.74338518]
[ 2.82978012  1.47799176  2.47826314]
[ 4.95280953  0.99847866  4.751153  ]
[ 0.21052844  3.9762945   1.03486949]
[ 2.83319411  1.58931368  2.45478204]
[ 4.11448525  0.78943628  3.80155076]
[ 3.25171579  1.15482739  2.90848555]
[ 1.05498289  4.40838232  0.58462654]
[ 4.07590344  0.79062294  3.74946061]
[ 3.66646818  0.49925074  3.58342781]
[ 4.53662748  0.57561674  4.39186235]
[ 0.40864274  3.77647339  1.15583514]
[ 0.72225726  4.12849363  0.41144647]
[ 0.26392339  4.06559257  0.74226334]
[ 0.38599079  3.75084452  0.807334  ]
[ 0.69209023  3.92198415  0.33058765]
[ 0.50364229  4.02381986  0.49256627]
[ 0.24560583  4.15409954  0.79117309]
[ 4.34775677  0.42948764  4.21417705]
[ 3.04242045  1.06735716  2.80106317]
[ 0.69832339  3.99732739  0.2551892 ]
[ 2.941483    1.2466126   2.65473568]
[ 0.65191172  4.08425346  0.40429551]
[ 0.78802848  4.20095243  0.42831631]
[ 4.32353893  0.88062552  4.07759588]
[ 4.73789569  0.75227519  4.54973862]
[ 3.69012044  0.47399153  3.54755148]
[ 4.67507457  0.85053981  4.59430316]
[ 5.18308038  1.28171485  5.08798796]
[ 0.63691095  4.17096627  1.50641568]
[ 6.09893342  2.44192574  6.11787993]
[ 5.1680095   1.36063758  5.0286799 ]
[ 6.09499704  2.23515502  6.00237077]
[ 4.4883912   0.78573297  4.23173978]
[ 0.24830537  3.99195562  0.94036599]
[ 0.64780827  3.77581134  0.49002197]
[ 3.94372779  0.78321004  3.6672135 ]
[ 1.52959326  4.37200865  0.83273937]
[ 0.76528136  4.25027956  0.47182786]
[ 3.79363356  0.55366323  3.72895716]
[ 2.93966929  1.24953377  2.67163275]
[ 4.99723146  1.14831455  4.91656603]
[ 3.33771212  0.87979712  3.06990036]
[ 3.90414168  0.26786434  3.72918063]
[ 5.86995079  2.33512162  5.92165136]
[ 5.18925064  1.39234561  5.12779565]
[ 5.13578837  1.24175667  5.03679344]
[ 2.96720782  1.38972466  2.60146527]
[ 3.96278381  0.45124971  3.72851018]
[ 3.08312432  1.14477129  2.77956019]
[ 3.54640507  1.54349483  3.10646447]
[ 0.70497439  4.29613213  0.53396772]
[ 4.5892253   0.63367418  4.40285379]
[ 4.01368354  0.23112328  3.81478984]
[ 4.53030413  0.89726509  4.4160829 ]
[ 5.01088704  1.21410171  4.92308726]
[ 0.76745612  4.12877116  0.36302644]
[ 2.34371263  2.35444643  1.78095149]
[ 0.16428295  3.99187733  0.76765109]
[ 0.334149    4.03857458  0.58888725]
[ 2.07757604  2.42737883  1.57510471]
[ 2.71335995  2.27982118  2.1154483 ]
[ 4.90057706  1.03334068  4.7890627 ]
[ 4.94600737  0.98714722  4.79193296]
[ 3.69995345  0.8736187   3.40675136]
[ 0.74228626  4.24100534  1.59299138]
[ 3.54612308  0.92041728  3.28622705]
[ 3.48467917  0.71667726  3.4062621 ]
[ 0.82400782  4.16043884  0.35017357]
[ 0.48544367  3.6660721   0.93280305]
[ 4.73740318  0.82505129  4.62665699]
[ 4.12322554  0.59815798  3.85315302]
[ 3.50879688  0.78201213  3.46460025]
[ 2.38697624  2.31941113  1.82600334]
[ 5.52222681  1.65494249  5.41995894]
[ 3.87809948  0.85396212  3.87106207]
[ 5.50178658  1.78403839  5.48263515]
[ 3.60467135  0.46152064  3.41652185]
[ 4.77021896  1.06393592  4.658965  ]
[ 0.66607474  4.23508181  0.58747045]
[ 3.34748496  0.88734227  3.10633034]
[ 0.49496352  4.16301689  1.2806723 ]
[ 4.0442126   0.65660209  3.99334298]
[ 3.64385175  0.61954389  3.53519752]
[ 3.87938855  0.24891692  3.68659123]
[ 2.50153064  1.60028684  2.25594508]
[ 4.77720164  0.80950168  4.58958475]
[ 3.43661494  0.60404992  3.23987472]
[ 4.04223398  0.4560716   3.83429109]
[ 4.51201975  0.86979479  4.44504835]
[ 5.9736914   2.11297609  5.89831514]
[ 3.94127588  0.60577193  3.83200838]
[ 0.43010335  3.83322662  0.67954999]
[ 3.31335513  0.7899639   3.19010787]
[ 4.49625647  0.57886495  4.35499195]
[ 0.36740834  3.88124245  0.54248336]
[ 6.38573845  2.46430375  6.26426012]
[ 0.17978382  3.98602784  0.86849767]
[ 0.32607089  3.8544997   0.837728  ]
[ 3.5698257   0.61920753  3.36230896]
[ 4.84695666  0.99826999  4.76026486]
[ 3.33051381  1.0550875   3.05711545]
[ 0.13287421  4.0004793   0.97431422]
[ 4.98052764  1.21975119  4.90196779]
[ 0.65191172  4.08425346  0.40429551]
[ 0.90902267  4.05052482  1.75455261]
[ 2.97976323  1.14595359  2.71801794]
[ 2.98501629  1.07996974  2.80269897]
[ 2.43741439  1.7697932   2.13114715]
[ 4.23693941  0.48053925  4.1398617 ]
[ 2.98736041  1.11387146  2.767843  ]
[ 2.91484515  1.20324269  2.68035225]
[ 3.04964297  1.4593382   2.6437262 ]
[ 3.79477565  0.3866001   3.60914877]
[ 4.18748002  0.44309664  3.95444326]
[ 0.27986108  4.00713962  0.63910473]
[ 0.20080394  4.04102416  0.95356604]
[ 4.12322554  0.59815798  3.85315302]
[ 0.94709497  4.41763526  0.6325516 ]
[ 3.0967169   1.42342063  2.76513198]
[ 0.36921839  4.09706211  1.04687863]
[ 0.31143039  4.19045061  0.75616678]
[ 0.41751913  4.11388417  0.57310982]
[ 0.47852784  3.71549139  0.52926508]
[ 5.43166539  1.57925393  5.33331868]
[ 0.69497402  4.04257051  0.32034595]
[ 0.38164847  4.12806468  1.24068054]
[ 4.50751102  0.74991753  4.38948989]
[ 4.13081778  0.40010578  4.03135067]
[ 4.4217254   0.43670314  4.2255321 ]
[ 3.24730487  0.79364747  3.09086636]
[ 3.74668594  0.55140091  3.58366035]
[ 2.66351689  1.7397561   2.26240024]
[ 0.65191172  4.08425346  0.40429551]
[ 4.06451993  0.25795924  3.87525761]
[ 3.49899827  0.60473931  3.42358801]
[ 3.13044015  0.90719419  2.94903174]
[ 0.21210584  4.10765054  0.75395946]
[ 1.10739434  4.32414265  0.51165242]
[ 0.78251021  4.6354478   1.01124751]
[ 5.662419    1.88532746  5.61464943]
[ 0.56714686  4.28051317  0.67153173]
[ 4.04874329  0.40476491  3.80231792]
[ 4.53633357  0.75282887  4.42663772]
[ 3.49575775  0.53723952  3.33055874]
[ 0.55766378  3.72596202  1.27969587]
[ 3.40210556  0.76708407  3.30708152]
[ 1.19233198  4.65262932  0.8231777 ]
[ 2.77554599  1.62829194  2.37945124]
[ 2.821168    1.30754272  2.55883206]
[ 0.19745604  4.07212082  0.74338518]
[ 2.82978012  1.47799176  2.47826314]
[ 4.95280953  0.99847866  4.751153  ]
[ 0.21052844  3.9762945   1.03486949]
[ 2.83319411  1.58931368  2.45478204]
[ 4.11448525  0.78943628  3.80155076]
[ 3.25171579  1.15482739  2.90848555]
[ 1.05498289  4.40838232  0.58462654]
[ 4.07590344  0.79062294  3.74946061]
[ 3.66646818  0.49925074  3.58342781]
[ 4.53662748  0.57561674  4.39186235]
[ 0.40864274  3.77647339  1.15583514]
[ 0.72225726  4.12849363  0.41144647]
[ 0.26392339  4.06559257  0.74226334]
[ 0.38599079  3.75084452  0.807334  ]
[ 0.69209023  3.92198415  0.33058765]
[ 0.50364229  4.02381986  0.49256627]
[ 0.24560583  4.15409954  0.79117309]
[ 4.34775677  0.42948764  4.21417705]
[ 3.04242045  1.06735716  2.80106317]
[ 0.69832339  3.99732739  0.2551892 ]
[ 2.941483    1.2466126   2.65473568]
[ 0.65191172  4.08425346  0.40429551]
[ 0.78802848  4.20095243  0.42831631]
[ 4.32353893  0.88062552  4.07759588]
[ 4.73789569  0.75227519  4.54973862]
[ 3.69012044  0.47399153  3.54755148]
[ 4.67507457  0.85053981  4.59430316]
[ 5.18308038  1.28171485  5.08798796]
[ 0.63691095  4.17096627  1.50641568]
[ 6.09893342  2.44192574  6.11787993]
[ 5.1680095   1.36063758  5.0286799 ]
[ 6.09499704  2.23515502  6.00237077]
[ 4.4883912   0.78573297  4.23173978]
[ 0.24830537  3.99195562  0.94036599]
[ 0.64780827  3.77581134  0.49002197]
[ 3.94372779  0.78321004  3.6672135 ]
[ 1.52959326  4.37200865  0.83273937]
[ 0.76528136  4.25027956  0.47182786]
[ 3.79363356  0.55366323  3.72895716]
[ 2.93966929  1.24953377  2.67163275]
[ 4.99723146  1.14831455  4.91656603]
[ 3.33771212  0.87979712  3.06990036]
[ 3.90414168  0.26786434  3.72918063]
[ 5.86995079  2.33512162  5.92165136]
[ 5.18925064  1.39234561  5.12779565]
[ 5.13578837  1.24175667  5.03679344]
[ 2.96720782  1.38972466  2.60146527]
[ 3.96278381  0.45124971  3.72851018]
[ 3.08312432  1.14477129  2.77956019]
[ 3.54640507  1.54349483  3.10646447]
[ 0.70497439  4.29613213  0.53396772]
[ 4.5892253   0.63367418  4.40285379]
[ 4.01368354  0.23112328  3.81478984]
[ 4.53030413  0.89726509  4.4160829 ]
[ 5.01088704  1.21410171  4.92308726]
[ 0.76745612  4.12877116  0.36302644]
[ 2.34371263  2.35444643  1.78095149]
[ 0.16428295  3.99187733  0.76765109]
[ 0.334149    4.03857458  0.58888725]
[ 2.07757604  2.42737883  1.57510471]
[ 2.71335995  2.27982118  2.1154483 ]
[ 4.90057706  1.03334068  4.7890627 ]
[ 4.94600737  0.98714722  4.79193296]
[ 3.69995345  0.8736187   3.40675136]
[ 0.74228626  4.24100534  1.59299138]
[ 3.54612308  0.92041728  3.28622705]
[ 3.48467917  0.71667726  3.4062621 ]
[ 0.82400782  4.16043884  0.35017357]
[ 0.48544367  3.6660721   0.93280305]
[ 4.73740318  0.82505129  4.62665699]
[ 4.12322554  0.59815798  3.85315302]
[ 3.50879688  0.78201213  3.46460025]
[ 2.38697624  2.31941113  1.82600334]
[ 5.52222681  1.65494249  5.41995894]
[ 3.87809948  0.85396212  3.87106207]
[ 5.50178658  1.78403839  5.48263515]
[ 3.60467135  0.46152064  3.41652185]
[ 4.77021896  1.06393592  4.658965  ]
[ 0.66607474  4.23508181  0.58747045]
[ 3.34748496  0.88734227  3.10633034]
[ 0.49496352  4.16301689  1.2806723 ]
[ 4.0442126   0.65660209  3.99334298]
[ 3.64385175  0.61954389  3.53519752]
[ 3.87938855  0.24891692  3.68659123]
[ 2.50153064  1.60028684  2.25594508]
[ 4.77720164  0.80950168  4.58958475]
[ 3.43661494  0.60404992  3.23987472]
[ 4.04223398  0.4560716   3.83429109]
[ 4.51201975  0.86979479  4.44504835]
[ 5.9736914   2.11297609  5.89831514]
[ 3.94127588  0.60577193  3.83200838]
[ 0.43010335  3.83322662  0.67954999]
[ 3.31335513  0.7899639   3.19010787]
[ 4.49625647  0.57886495  4.35499195]
[ 0.36740834  3.88124245  0.54248336]
[ 6.38573845  2.46430375  6.26426012]
[ 0.17978382  3.98602784  0.86849767]
[ 0.32607089  3.8544997   0.837728  ]
[ 3.5698257   0.61920753  3.36230896]
[ 4.84695666  0.99826999  4.76026486]
[ 3.33051381  1.0550875   3.05711545]
In [8]:
print('Centroids of self-built K-means: ', labels)
print('Centroids of Sci-kit learn K-means: ', labels2)
Centroids of self-built K-means:  [array([2]), array([1]), array([2]), array([0]), array([0]), array([2]), array([2]), array([2]), array([0]), array([0]), array([1]), array([2]), array([2]), array([2]), array([1]), array([0]), array([0]), array([2]), array([2]), array([1]), array([2]), array([2]), array([0]), array([1]), array([1]), array([0]), array([0]), array([2]), array([2]), array([0]), array([1]), array([0]), array([1]), array([0]), array([0]), array([2]), array([2]), array([0]), array([2]), array([2]), array([2]), array([2]), array([2]), array([2]), array([2]), array([1]), array([1]), array([2]), array([2]), array([2]), array([2]), array([1]), array([0]), array([1]), array([2]), array([0]), array([1]), array([0]), array([2]), array([2]), array([2]), array([2]), array([2]), array([0]), array([2]), array([1]), array([2]), array([0]), array([1]), array([2]), array([2]), array([2]), array([2]), array([2]), array([0]), array([0]), array([0]), array([1]), array([0]), array([0]), array([0]), array([0]), array([0]), array([0]), array([2]), array([1]), array([0]), array([0]), array([0]), array([2]), array([2]), array([0]), array([1]), array([0]), array([0]), array([2]), array([1]), array([0]), array([0]), array([1]), array([1]), array([1]), array([2]), array([1]), array([2]), array([2]), array([0]), array([2]), array([0]), array([0]), array([2]), array([1]), array([0]), array([2]), array([2]), array([1]), array([0]), array([2]), array([2]), array([1]), array([2]), array([2]), array([1]), array([0]), array([1]), array([0]), array([1]), array([0]), array([1]), array([0]), array([2]), array([1]), array([1]), array([1]), array([1]), array([2]), array([1]), array([0]), array([0]), array([2]), array([0]), array([1]), array([1]), array([2]), array([2]), array([2]), array([0]), array([1]), array([0]), array([2])]
Centroids of Sci-kit learn K-means:  [2 1 2 0 0 2 2 2 0 0 1 2 2 2 1 0 0 2 2 1 2 2 0 1 1 0 0 2 2 0 1 0 1 0 0 2 2
 0 2 2 2 2 2 2 2 1 1 2 2 2 2 1 0 1 2 0 1 0 2 2 2 2 2 0 2 1 2 0 1 2 2 2 2 2
 0 0 0 1 0 0 0 0 0 0 2 1 0 0 0 2 2 0 1 0 0 2 1 0 0 1 1 1 2 1 2 2 0 2 0 0 2
 1 0 2 2 1 0 2 2 1 2 2 1 0 1 0 1 0 1 0 2 1 1 1 1 2 1 0 0 2 0 1 1 2 2 2 0 1
 0 2]
In [8]:
#plot the clusters in color
fig = plt.figure(1, figsize=(8, 8))
plt.clf()
ax = Axes3D(fig, rect=[0, 0, 1, 1], elev=8, azim=200)
plt.cla()
ax.scatter(X_iris[:, 3], X_iris[:, 0], X_iris[:, 2], c=labels)
ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
ax.set_xlabel('Petal width')
ax.set_ylabel('Sepal length')
ax.set_zlabel('Petal length')
plt.title('Clustering on Iris Dataset')
plt.show()
/opt/conda/lib/python3.5/site-packages/matplotlib/collections.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  if self._edgecolors == str('face'):
PUT YOUR COMMENT HERE !!!

We can see that the data is a little bit clearly seperated to 3 clusters. But it seems that there is some difficulties to determine whether a point belongs to red cluster or blue cluster

We think that our algorithm work fine because we can see that: points belong to one cluster seem near points in the same cluster but far away enough from points from other clusters.

The greencluster is clearly seperated from red cluster and blue cluster.

To get more insight information, we would like to use PCA to calculate the 3 possible largest variances to form a 3D dimension. We expect that the new 3D will visualize better than the graph above

In [10]:
from sklearn.decomposition import PCA

pca = PCA(n_components=3)
score = pca.fit_transform(X_iris)
PCA(copy=True, n_components=3, whiten=False)

#plot the clusters in color
fig = plt.figure(1, figsize=(8, 8))
plt.clf()
ax = Axes3D(fig, rect=[0, 0, 1, 1], elev=8, azim=200)
plt.cla()
ax.scatter(score[:, 2], score[:, 1], score[:, 0], c=labels)

ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')
plt.title('Clustering on Iris Dataset with PCA')
plt.show()
/opt/conda/lib/python3.5/site-packages/matplotlib/collections.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  if self._edgecolors == str('face'):
PUT YOUR COMMENT HERE !!!

In this graph, it shows that the points spread more than the previous one. It visualize better in somehow (in our opinion)

That's enough about K-means for now. In the next section, we will apply MMLIB's K-means on Spark to deal with a large data in the real usecase.

2. Usecase: Network Intrusion

Some attacks attempt to flood a computer with network traffic. In some other cases, attacks attempt to exploit flaws in networking software in order to gain unauthorized access to a computer. Detecting an exploit in an incredibly large haystack of network requests is not easy.

Some exploit behaviors follow known patterns such as scanning every port in a short of time, sending a burst of request to a port... However, the biggest threat may be the one that has never been detected and classified yet. Part of detecting potential network intrusions is detecting anomalies. These are connections that aren't known to be attacks, but, do not resemble connections that have been observed in the past.

In this notebook, K-means is used to detect anomalous network connections based on statistics about each of them.

2.1. Data

The data comes from KDD Cup 1999. The dataset is about 708MB and contains about 4.9M connections. For each connection, the data set contains information like the number of bytes sent, login attempts, TCP errors, and so on. Each connection is one line of CSV-formatted data, containing 38 features: back, buffer_overflow, ftp_write, guess_passwd, imap, ipsweep, land, loadmodule, multihop, neptune, nmap, normal, perl, phf, pod, portsweep, rootkit, satan, smurf, spy, teardrop, warezclient, warezmaster. For more details about each feature, please follow this link.

Many features take on the value 0 or 1, indicating the presence or absence of a behavior such as su_attempted in the 15th column. Some features are counts, like num_file_creations in the 17th columns. Some others are the number of sent and received bytes.

2.2. Clustering without using categorical features

First, we need to import some packages that are used in this notebook.

In [8]:
import os
import sys
import re
from sklearn import datasets, cluster
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
from sklearn.decomposition import PCA
from pyspark import SparkContext
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql import Row
from pyspark.sql.functions import *
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import pyspark.sql.functions as func
import matplotlib.patches as mpatches
from pyspark.mllib.clustering import KMeans, KMeansModel

input_path = "/datasets/k-means/kddcup.data"
raw_data = sc.textFile(input_path, 12)

2.2.1. Loading data

There are two types of features: numerical features and categorical features. Currently, to get familiar with the data and the problem, we only use numerical features. In our data, we also have pre-defined groups for each connection, which we can use later as our "ground truth" for verifying our results.

Note 1: we don't use the labels in the training phase!!!

Note 2: in general, since clustering is un-supervised, you don't have access to ground truth. For this reason, several metrics to judge the quality of clustering have been devised. For a short overview of such metrics, follow this link. Note that computing such metrics, that is trying to assess the quality of your clustering results, is as computationally intensive as computing the clustering itself!

Question 2

Write function `parseLine` to construct a tuple of `(label, vector)` for each connection, extract the data that contains only the data points (without label), then print the number of connections.

Where,

  • label is the pre-defined label of each connection
  • vector is a numpy array that contains values of all features, but the label and the categorial features at index 1,2,3 of each connection. Each vector is a data point.
In [9]:
def parseLine(line):
    
    cols = line.split(",")
    # label is the last column
    label = cols[-1]
    
    # vector is every column, except the label
    vector = cols[:-1]
    
    # delete values of columns that have index 1->3 (categorical features)
    vector[1:4]=[]
    
    # convert each value from string to float
    vector = np.array(list(map(lambda x: float(x), vector)))
    return (label, vector)


labelsAndData = raw_data.map(lambda line: parseLine(line))

# we only need the data, not the labeldef parseLine(line):
data = labelsAndData.map(lambda x: x[1]).cache()

# number of connections
n = data.count()

print('number of connections: ', n)
number of connections:  4898431
In [14]:
print(data.take(1))
[array([  0.00000000e+00,   2.15000000e+02,   4.50760000e+04,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   1.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   1.00000000e+00,   1.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   1.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00])]
In [10]:
#Get all labels in the dataset
labels_data = labelsAndData.map(lambda x: x[0]).cache()
#Count the distribution of labels 
list_classes = []
count_classes = []
for key, value in sorted(labels_data.countByValue().items(), key = lambda x: -x[1]):
    list_classes.append(key)
    count_classes.append(value)
    print(key, value)
smurf. 2807886
neptune. 1072017
normal. 972781
satan. 15892
ipsweep. 12481
portsweep. 10413
nmap. 2316
back. 2203
warezclient. 1020
teardrop. 979
pod. 264
guess_passwd. 53
buffer_overflow. 30
land. 21
warezmaster. 20
imap. 12
rootkit. 10
loadmodule. 9
ftp_write. 8
multihop. 7
phf. 4
perl. 3
spy. 2
In [14]:
import plotly
import plotly.graph_objs as go
plotly.offline.init_notebook_mode()
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

fig = {
    'data': [{'labels': list_classes,
              'values': count_classes,
              'type': 'pie'}],
    'layout': {'title': 'Distribution of classes in dataset '}
     }

iplot(fig)

Data summarization:

We can see that the data has 4898431 records divided into 23 classes.

The top 3 largest classes: smurf (57.3%), neptune (21.9%), normal (19.9%)

The top 3 smallest classes: phf (8.2e-7%), perl (6.1e-7%), spy (4.1e-7%)

Hardly to say about the distribution of the dataset.

Question 3

Using K-means algorithm of MLLIB, cluster the connections into two groups then plot the result. Why two groups? In this case, we are just warming up, we're testing things around, so "two groups" has no particular meaning.

You can use the following parameters:

  • `maxIterations=10`
  • `runs=10`
  • `initializationMode="random"`

Discuss the result from your figure.

In [13]:
clusters = KMeans.train(data, 2, maxIterations=10, runs=10, initializationMode="random")
/opt/spark/python/pyspark/mllib/clustering.py:347: UserWarning:

The param `runs` has no effect since Spark 2.0.0.

We reuse the code of visualizing data in the previous part, then putting it in the function PlotData(data,rate,clusters)

In [11]:
def PlotData(data, rate, clusters):
    sampledData = data.sample(False, rate)

    clusterCentroids = sc.parallelize(clusters.centers)
    sampledDataCentroids = sampledData + clusterCentroids
    arraysampledDataCentroids = np.array(sampledDataCentroids.take(sampledDataCentroids.count()))
    print(arraysampledDataCentroids.shape)
    
    Y_labels = sampledDataCentroids.map(lambda x: clusters.predict(x))
    Y_labels = np.array(Y_labels.take(Y_labels.count()))
    
    pca = PCA(n_components=3)
    score = pca.fit_transform(arraysampledDataCentroids)
    PCA(copy=True, n_components=3, whiten=False)

    #plot the clusters in color
    fig = plt.figure(1, figsize=(8, 8))
    plt.clf()
    ax = Axes3D(fig, rect=[0, 0, 1, 1], elev=8, azim=200)
    plt.cla()
    ax.scatter(score[:,0],score[:,1], score[:,2], c=Y_labels)

    ax.w_xaxis.set_ticklabels([])
    ax.w_yaxis.set_ticklabels([])
    ax.w_zaxis.set_ticklabels([])
    ax.set_xlabel('X ')
    ax.set_ylabel('Y')
    ax.set_zlabel('Z')

    plt.show()
In [19]:
PlotData(data, 0.3, clusters)
(1469606, 38)
/opt/conda/lib/python3.5/site-packages/matplotlib/collections.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  if self._edgecolors == str('face'):
PUT YOUR ANSWER HERE !!!

Note that: we only plot 30% samples on over whole dataset

Most of the points fall on top of one another (actually, in the straight line in Z axis), and the result is sparse and hard to interpret.

However, the dominant feature of the visualization is its "L" shape (we can see it on the XZ axes) . The points seem to vary along two distinct dimensions, and little in other dimensions.

This could be explained that the dataset has two features that are on a much larger scale than others. Whereas most features have values between 0 and 1, the bytes-sent and bytes-received features vary from 0 to ten of thousands. The Euclidean distance between points is therefore almost complextely determined by these two features. It's almost as if the other features don't exist.

So it is important to normalize away these differences in scale to put features on near-equal footing

2.2.3. Evaluating model

Question 4

One of the simplest method to evaluate our result is calculate the Within Set Sum of Squared Errors (WSSSE), or simply, 'Sum of Squared Errors'. An error of a data point is defined as it's distance to the closest cluster center.
In [14]:
from operator import add

# Evaluate clustering by computing Within Set Sum of Squared Errors
def error(clusters, point):
    closest_center = clusters.centers[clusters.predict(point)]
    return euclidean_distance(point, closest_center)
In [15]:
WSSSE = data.map(lambda point: error(clusters, point)).reduce(lambda x, y: x + y)
print("Within Set Sum of Squared Error = " + str(WSSSE))
Within Set Sum of Squared Error = 20839996286.8

Question 5

This is a good opportunity to use the given labels to get an intuitive sense of what went into these two clusters, by counting the labels within each cluster. Complete the following code that uses the model to assign each data point to a cluster, and counts occurrences of cluster and label pairs. What do you think about the result?
In [19]:
clusterLabelCount = sorted(labelsAndData.map(lambda x: (clusters.predict(x[1]), x[0])).countByValue().items(), key = lambda xx: -xx[1])
for item in clusterLabelCount:
    print(item)
((0, 'smurf.'), 2807886)
((0, 'neptune.'), 1072017)
((0, 'normal.'), 972781)
((0, 'satan.'), 15892)
((0, 'ipsweep.'), 12481)
((0, 'portsweep.'), 10408)
((0, 'nmap.'), 2316)
((0, 'back.'), 2203)
((0, 'warezclient.'), 1020)
((0, 'teardrop.'), 979)
((0, 'pod.'), 264)
((0, 'guess_passwd.'), 53)
((0, 'buffer_overflow.'), 30)
((0, 'land.'), 21)
((0, 'warezmaster.'), 20)
((0, 'imap.'), 12)
((0, 'rootkit.'), 10)
((0, 'loadmodule.'), 9)
((0, 'ftp_write.'), 8)
((0, 'multihop.'), 7)
((1, 'portsweep.'), 5)
((0, 'phf.'), 4)
((0, 'perl.'), 3)
((0, 'spy.'), 2)
PUT YOUR COMMENT HERE

The result shows that the clustering was not at all helpful.

Only 10 data point of class "portsweep" end up in cluster '1'. But class "portsweep" have 10413 elements classified as cluster '0'. We can obviously see that there is imbalance to classified datapoints in a class. And the misproportion does to every class in 23 labels that we have

Two clusters are plainly insufficient. In particular, for our own dataset, it's clear that there are 23 distinct classes, so it seems that k should be at least 23 , or likely, even more.

2.2.4. Choosing K

How many clusters are appropriate for a dataset? In particular, for our own dataset, it's clear that there are 23 distinct behavior patterns in the data, so it seems that k could be at least 23, or likely, even more. In other cases, we even don't have any information about the number of patterns at all (remember, generally your data is not labelled!). Our task now is finding a good value of $k$. For doing that, we have to build and evaluate models with different values of $k$. A clustering could be considered good if each data point were near to its closest centroid. One of the ways to evaluate a model is calculating the Mean of Squared Errors of all data points.

Question 6

Complete the function below to calculate the MSE of each model that is corresponding to each value of $k$. Plot the results. From the obtained result, what is the best value for $k$? Why?
In [16]:
# k: the number of clusters
def clusteringScore(data, k):
    clusters = KMeans.train(data, k, maxIterations=10, runs=10, initializationMode="random")
    # calculate mean square error
    score = data.map(lambda point: error(clusters, point)).mean()
    print(score)
    return score
In [19]:
kList = range(5,121,5)
scores = list(map(lambda xk: (xk , clusteringScore(data, xk)), kList))
/opt/spark/python/pyspark/mllib/clustering.py:347: UserWarning: The param `runs` has no effect since Spark 2.0.0.
  warnings.warn("The param `runs` has no effect since Spark 2.0.0.")
4254.42275022
2146.95298298
1594.58913877
1586.76156705
1127.27444207
1415.39753099
1083.89091088
1137.41675317
1071.05839445
1041.85747064
1055.67001132
1114.48806286
970.773383168
967.701682823
933.200337438
969.397024588
944.599647518
977.941682483
975.380941916
956.872406351
982.226180595
943.145327694
881.270158057
947.155791838
In [15]:
Scores = [[5,4254.42275022],[10,2146.95298298],[15,1594.58913877],[20,1586.76156705],[25,1127.27444207],[30,1415.39753099],[35,1083.89091088] \
          ,[40,1137.41675317],[45,1071.05839445],[50,1041.85747064],[55,1055.67001132],[60,1114.48806286],[65,970.773383168],[70,967.701682823] \
          ,[75,933.200337438],[80,969.397024588],[85,944.599647518],[90,977.941682483],[95,975.380941916],[100,956.872406351],[105,982.226180595] \
          ,[110,943.145327694],[115,881.270158057],[120,947.155791838]]
In [12]:
!pip install --upgrade pip
Collecting pip
  Downloading pip-9.0.1-py2.py3-none-any.whl (1.3MB)
    100% |████████████████████████████████| 1.3MB 672kB/s 
Installing collected packages: pip
  Found existing installation: pip 8.1.2
    Uninstalling pip-8.1.2:
      Successfully uninstalled pip-8.1.2
Successfully installed pip-9.0.1
In [13]:
!pip install plotly
Collecting plotly
  Downloading plotly-2.0.8.tar.gz (964kB)
    100% |████████████████████████████████| 972kB 792kB/s 
Requirement already satisfied: decorator in /opt/conda/lib/python3.5/site-packages (from plotly)
Requirement already satisfied: nbformat>=4.2 in /opt/conda/lib/python3.5/site-packages (from plotly)
Requirement already satisfied: pytz in /opt/conda/lib/python3.5/site-packages (from plotly)
Requirement already satisfied: requests in /opt/conda/lib/python3.5/site-packages (from plotly)
Requirement already satisfied: six in /opt/conda/lib/python3.5/site-packages (from plotly)
Building wheels for collected packages: plotly
  Running setup.py bdist_wheel for plotly ... - \ done
  Stored in directory: /root/.cache/pip/wheels/71/02/b9/818206a93e4fb056c660468645f10f09362902d63456d57535
Successfully built plotly
Installing collected packages: plotly
Successfully installed plotly-2.0.8
In [16]:
import plotly
import plotly.graph_objs as go
plotly.offline.init_notebook_mode()
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

npScores = np.array(Scores)
trace0 = go.Bar(
    x=npScores[:,0],
    y=npScores[:,1],
    marker=dict(
        color='rgb(158,202,225)',
        line=dict(
            color='rgb(8,48,107)',
            width=1.5,
        )
    ),
    opacity=0.6
)

data1 = [trace0]
layout = go.Layout(
    title='Figure 2: Mean squared error for each k',
    xaxis=dict(
        title='Number of K'
    ),
    yaxis=dict(
        title='Mean Square Error'
    )
)

fig = go.Figure(data=data1, layout=layout)
iplot(fig, filename='text-hover-bar')
PUT YOUR ANSWER HERE !

We can see that the MSE values decrease when k-values are increased.

It's obvious that as more as clusters are added, it should always be possible to make data points classified closer to a nearest centroid. In fact, if k is chosen to equal the number of data points, the average distance will be 0, because every point will be its own cluster of one.

For some k values, for example k = 55, its MSE is lower than k = 60, 65, 70. I think this shouldn't happen because the higher k values always permit at least as good a clustering as a lower k. The problem is that K-means is not necessarily able to find the optimal cluster for a given k. Its iterative process can converge from a random starting point to a local mininum, which may give us a 'good' result (I mean in term of MSE ) but not optimal (It is difficult to have the minimun MSE with random starting centroids). In this case, our iteration is just 10. Perhaps it led to a particularly supoptimal clustering, or maybe stopped early before it reached its local optimum.

For a particular k value, assuming that we have the number iteration large enough, iterate several times with random centroids and maybe decrease the value of epsilon to let the centroids continue to move longer before converge. Afterall, we choose the best performance from them to become the performance of this k-value.

We want to find the area which a k-value reduces the MSE score much then at the next k-values such that the MSE scores increase , or an area in the graph above, which is generally decreasing but eventuallly flattens out.

Here from k = 55 to the last, that is the area that we interest to find.

2.2.5 Normalizing features

K-means clustering treats equally all dimensions/directions of the space and therefore tends to produce more or less spherical (rather than elongated) clusters. In this situation, leaving variances uneven is equivalent to putting more weight on variables with smaller variance, so clusters will tend to be separated along variables with greater variance.

In our notebook, since Euclidean distance is used, the clusters will be influenced strongly by the magnitudes of the variables, especially by outliers. Normalizing will remove this bias.

Each feature can be normalized by converting it to a standard score. This means subtracting the mean of the feature’s values from each value, and dividing by the standard deviation

$normalize_i=\frac{feature_i - \mu_i}{\sigma_i}$

Where,

  • $normalize_i$ is the normalized value of feature $i$
  • $\mu_i$ is the mean of feature $i$
  • $\sigma_i$ is the standard deviation of feature $i$

Question 7

Complete the code below to normalize the data. Print the first 5 lines of the new data.

HINT
If $\sigma_i = 0$ then $normalize_i=feature_i - \mu_i$

In [17]:
def normalizeData(data):
    # number of connections
    n = data.count()
    print("n = ", n)

    # calculate the sum of each feature
    sums = data.reduce(lambda x,y: x + y)
    print("sum = ", sums)

    # calculate means
    means = sums/n
    print("means = " , means)

    # calculate the sum square of each feature
    sumSquares = data.map(lambda x: x**2).reduce(lambda x,y:x+y) 
    print("sumSquares = ", sumSquares)

    # calculate standard deviation of each feature
    stdevs = np.sqrt((sumSquares/n - means**2))
    stdevs[stdevs <= 0] = 1
    print("stdevs = ", stdevs)

    def normalize(point):
        return (point - means) / stdevs

    return data.map(lambda x : normalize(x))
In [18]:
normalizedData = normalizeData(data).cache()
print("normalized Data = ", normalizedData.take(5))
n =  4898431
sum =  [  2.36802060e+08   8.98676524e+09   5.35703589e+09   2.80000000e+01
   3.17800000e+03   3.90000000e+01   6.09250000e+04   1.57000000e+02
   7.03067000e+05   3.96200000e+04   3.34000000e+02   1.80000000e+02
   6.33610000e+04   5.82300000e+03   3.64000000e+02   5.00200000e+03
   0.00000000e+00   2.00000000e+00   4.09100000e+03   1.64084428e+09
   1.44634545e+09   8.71775140e+05   8.72101730e+05   2.82468470e+05
   2.82786920e+05   3.86919313e+06   1.03746840e+05   1.38433600e+05
   1.14124176e+09   9.26852923e+08   3.69201228e+06   1.50436230e+05
   2.96380553e+06   3.16639800e+04   8.72367200e+05   8.71361620e+05
   2.83755350e+05   2.82440660e+05]
means =  [  4.83424305e+01   1.83462118e+03   1.09362281e+03   5.71611604e-06
   6.48779170e-04   7.96173305e-06   1.24376561e-02   3.20510792e-05
   1.43529020e-01   8.08830419e-03   6.81850985e-05   3.67464602e-05
   1.29349582e-02   1.18874799e-03   7.43095085e-05   1.02114330e-03
   0.00000000e+00   4.08294003e-07   8.35165383e-04   3.34973440e+02
   2.95267086e+02   1.77970281e-01   1.78036953e-01   5.76650911e-02
   5.77301017e-02   7.89884175e-01   2.11796063e-02   2.82608043e-02
   2.32981083e+02   1.89214245e+02   7.53713236e-01   3.07111052e-02
   6.05052012e-01   6.46410657e-03   1.78091148e-01   1.77885862e-01
   5.79278038e-02   5.76594138e-02]
sumSquares =  [  2.57433563e+12   4.34145810e+18   2.03795314e+18   2.80000000e+01
   8.99800000e+03   2.55000000e+02   1.07812100e+06   2.61000000e+02
   7.03067000e+05   7.28519560e+07   3.34000000e+02   3.20000000e+02
   7.59678130e+07   7.55510000e+04   3.74000000e+02   6.18200000e+03
   0.00000000e+00   2.00000000e+00   4.09100000e+03   7.69775149e+11
   7.23474027e+11   8.69483055e+05   8.71016042e+05   2.80516842e+05
   2.81481562e+05   3.79857761e+06   3.57109172e+04   1.00690484e+05
   2.85964840e+11   2.30321984e+11   3.61091538e+06   6.23315491e+04
   2.92650429e+06   8.54361660e+03   8.69554008e+05   8.70465611e+05
   2.77693059e+05   2.77620044e+05]
stdevs =  [  7.23329737e+02   9.41430978e+05   6.45012268e+05   2.39083319e-03
   4.28543324e-02   7.21508295e-03   4.68978117e-01   7.29940683e-03
   3.50611523e-01   3.85648064e+00   8.25714534e-03   8.08243095e-03
   3.93807490e+00   1.24185736e-01   8.73758872e-03   3.55104777e-02
   1.00000000e+00   6.38978745e-04   2.88871577e-02   2.11990761e+02
   2.45992685e+02   3.81875553e-01   3.82254047e-01   2.32252900e-01
   2.32660379e-01   3.89295798e-01   8.27145751e-02   1.40559550e-01
   6.40209308e+01   1.05912757e+02   4.11185973e-01   1.08543203e-01
   4.80987669e-01   4.12597750e-02   3.81838168e-01   3.82177399e-01
   2.30942796e-01   2.30977686e-01]
normalized Data =  [array([ -6.68331854e-02,  -1.72038228e-03,   6.81884351e-02,
        -2.39084686e-03,  -1.51391734e-02,  -1.10348462e-03,
        -2.65207600e-02,  -4.39091558e-03,   2.44279187e+00,
        -2.09732783e-03,  -8.25770840e-03,  -4.54646139e-03,
        -3.28458917e-03,  -9.57233922e-03,  -8.50457842e-03,
        -2.87561127e-02,   0.00000000e+00,  -6.38979005e-04,
        -2.89113034e-02,  -1.57541507e+00,  -1.19624324e+00,
        -4.66042614e-01,  -4.65755574e-01,  -2.48285775e-01,
        -2.48130352e-01,   5.39733093e-01,  -2.56056520e-01,
        -2.01059296e-01,  -3.63913926e+00,  -1.78651044e+00,
        -1.83302273e+00,  -2.82939000e-01,  -1.25793664e+00,
        -1.56668488e-01,  -4.66404784e-01,  -4.65453641e-01,
        -2.50831829e-01,  -2.49631966e-01]), array([ -6.68331854e-02,  -1.77667956e-03,   5.32451452e-03,
        -2.39084686e-03,  -1.51391734e-02,  -1.10348462e-03,
        -2.65207600e-02,  -4.39091558e-03,   2.44279187e+00,
        -2.09732783e-03,  -8.25770840e-03,  -4.54646139e-03,
        -3.28458917e-03,  -9.57233922e-03,  -8.50457842e-03,
        -2.87561127e-02,   0.00000000e+00,  -6.38979005e-04,
        -2.89113034e-02,  -1.57069789e+00,  -1.19217808e+00,
        -4.66042614e-01,  -4.65755574e-01,  -2.48285775e-01,
        -2.48130352e-01,   5.39733093e-01,  -2.56056520e-01,
        -2.01059296e-01,  -3.62351937e+00,  -1.77706870e+00,
         5.98966843e-01,  -2.82939000e-01,   8.21118739e-01,
        -1.56668488e-01,  -4.66404784e-01,  -4.65453641e-01,
        -2.50831829e-01,  -2.49631966e-01]), array([ -6.68331854e-02,  -1.69807581e-03,   2.08332760e-04,
        -2.39084686e-03,  -1.51391734e-02,  -1.10348462e-03,
        -2.65207600e-02,  -4.39091558e-03,   2.44279187e+00,
        -2.09732783e-03,  -8.25770840e-03,  -4.54646139e-03,
        -3.28458917e-03,  -9.57233922e-03,  -8.50457842e-03,
        -2.87561127e-02,   0.00000000e+00,  -6.38979005e-04,
        -2.89113034e-02,  -1.57541507e+00,  -1.19624324e+00,
        -4.66042614e-01,  -4.65755574e-01,  -2.48285775e-01,
        -2.48130352e-01,   5.39733093e-01,  -2.56056520e-01,
        -2.01059296e-01,  -3.60789948e+00,  -1.76762697e+00,
         5.98966843e-01,  -2.82939000e-01,  -2.18408950e-01,
        -1.56668488e-01,  -4.66404784e-01,  -4.65453641e-01,
        -2.50831829e-01,  -2.49631966e-01]), array([ -6.68331854e-02,  -1.70126245e-03,   1.45482068e-03,
        -2.39084686e-03,  -1.51391734e-02,  -1.10348462e-03,
        -2.65207600e-02,  -4.39091558e-03,   2.44279187e+00,
        -2.09732783e-03,  -8.25770840e-03,  -4.54646139e-03,
        -3.28458917e-03,  -9.57233922e-03,  -8.50457842e-03,
        -2.87561127e-02,   0.00000000e+00,  -6.38979005e-04,
        -2.89113034e-02,  -1.57069789e+00,  -1.19217808e+00,
        -4.66042614e-01,  -4.65755574e-01,  -2.48285775e-01,
        -2.48130352e-01,   5.39733093e-01,  -2.56056520e-01,
        -2.01059296e-01,  -3.59227958e+00,  -1.75818524e+00,
         5.98966843e-01,  -2.82939000e-01,  -5.71848364e-01,
        -1.56668488e-01,  -4.66404784e-01,  -4.65453641e-01,
        -2.50831829e-01,  -2.49631966e-01]), array([ -6.68331854e-02,  -1.69488918e-03,  -9.42032957e-04,
        -2.39084686e-03,  -1.51391734e-02,  -1.10348462e-03,
        -2.65207600e-02,  -4.39091558e-03,   2.44279187e+00,
        -2.09732783e-03,  -8.25770840e-03,  -4.54646139e-03,
        -3.28458917e-03,  -9.57233922e-03,  -8.50457842e-03,
        -2.87561127e-02,   0.00000000e+00,  -6.38979005e-04,
        -2.89113034e-02,  -1.56598070e+00,  -1.18811292e+00,
        -4.66042614e-01,  -4.65755574e-01,  -2.48285775e-01,
        -2.48130352e-01,   5.39733093e-01,  -2.56056520e-01,
        -2.01059296e-01,  -3.57665969e+00,  -1.74874350e+00,
         5.98966843e-01,  -2.82939000e-01,  -7.38172794e-01,
        -1.56668488e-01,  -4.66404784e-01,  -4.65453641e-01,
        -2.50831829e-01,  -2.49631966e-01])]

Question 8

Using the new data, build different models with different values of $k \in [60,70,80,90,100,110]$. Evaluate the results by plotting them and choose the best value of $k$.
In [25]:
kList = range(60,111,10)
scores = list(map(lambda xk: (xk , clusteringScore(normalizedData, xk)), kList))
/opt/spark/python/pyspark/mllib/clustering.py:347: UserWarning:

The param `runs` has no effect since Spark 2.0.0.

0.474390911372
0.439719518963
0.438664447222
0.423802136193
0.454752147422
0.420780443995
In [17]:
Scores = [[60,0.474390911372],[70,0.439719518963],[80,0.438664447222],[90,0.423802136193],[100,0.454752147422],[110,0.420780443995]]
In [18]:
npScores = np.array(Scores)
trace0 = go.Bar(
    x=npScores[:,0],
    y=npScores[:,1],
    marker=dict(
        color='rgb(158,202,225)',
        line=dict(
            color='rgb(8,48,107)',
            width=1.5,
        )
    ),
    opacity=0.6
)

data2 = [trace0]
layout = go.Layout(
    title='Figure 2: Mean squared error for each k after normalization',
    xaxis=dict(
        title='Number of K'
    ),
    yaxis=dict(
        title='Mean Square Error'
    )
)

fig = go.Figure(data=data2, layout=layout)
iplot(fig, filename='text-hover-bar')
PUT YOUR ANSWER HERE !!!

Comment:

in the certain interval of k-values in both figures (Fig 1 and 2), they follow the same pattern for MSE score

As mentioned before, we can see that our interest area occurs at the value k = 80 (while in fig 1 is k = 55)

The best K-value after normalizing data is 110 comparing to k-value = 115 in fig 1. In my opinion, the way which we normalize data is just somehow increase the perfromance slightly.

Question 9

Plot the clustering result to see the difference between before and after normalizing features. Discuss about the difference and explain why and if normalization was useful.
In [19]:
clusters = KMeans.train(data,110, maxIterations=10, runs=10, initializationMode="random")
/opt/spark/python/pyspark/mllib/clustering.py:347: UserWarning:

The param `runs` has no effect since Spark 2.0.0.

Plotting real data

In [62]:
PlotData(data, 0.3, clusters)
(1471189, 38)
/opt/conda/lib/python3.5/site-packages/matplotlib/collections.py:590: FutureWarning:

elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison

Plotting normalized data

In [20]:
clusters = KMeans.train(normalizedData, 110, maxIterations=10, runs=10, initializationMode="random")
PlotData(normalizedData, 0.3, clusters)
/opt/spark/python/pyspark/mllib/clustering.py:347: UserWarning:

The param `runs` has no effect since Spark 2.0.0.

(1470179, 38)
/opt/conda/lib/python3.5/site-packages/matplotlib/collections.py:590: FutureWarning:

elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison

PUT YOUR ANSWER HERE !!!

We can see normalized data points reveals a richer structure, as expected. The dominant feature of the visualization is more robust. We don't see many points fall on top of one another ( lying on the straight line) like before normalization.

With 110 clusters, it's hard to make out which points come from which clusters (because of the limitation of visualized dimensions)

2.3. Clustering using categorical features

2.3.1 Loading data

In the previous section, we ignored the categorical features of our data: this is not a good idea, since these categorical features can be important in providing useful information for clustering. The problem is that K-means (or at least, the one we have developed and the one we use from MLLib) only work with data points in a metric space. Informally, this means that operations such as addition, subtraction and computing the mean of data points are trivial and well defined. For a more formal definition of what a metric space is, follow this link.

What we will do next is to transform each categorical feature into one or more numerical features. This approach is very widespread: imagine for example you wanted to use K-means to cluster text data. Then, the idea is to transform text data in $d$-dimensional vectors, and a nice way to do it is to use word2vec. If you're interested, follow this link to a nice blog post on the problem.

There are two approaches:

  • Approach 1: mapping one categorical feature to one numerical feature. The values in each categorical feature are encoded into unique numbers of the new numerical feature. For example, ['VERY HOT','HOT', 'COOL', 'COLD', 'VERY COLD'] will be encoded into [0,1,2,3,4,5]. However, by using this method, we implicit assume that the value of 'VERY HOT' is smaller than 'HOT'... This is not generally true.

  • Approach 2: mapping one categorical feature to multiple numerical features. Basically, a single variable with $n$ observations and $d$ distinct values, to $d$ binary variables with $n$ observations each. Each observation indicating the presence (1) or absence (0) of the $d^{th}$ binary variable. For example, ['house', 'car', 'tooth', 'car'] becomes

    [
    [1,0,0,0],
    [0,1,0,0],
    [0,0,1,0],
    [0,0,0,1],
    ]

We call the second approach "one-hot encoding". By using this approach, we keep the same role for all values of categorical features.

Question 10

Calculate the number of distinct categorical features value (at index `1,2,3`). Then construct a new input data using one-hot encoding for these categorical features (don't throw away numerical features!).
In [21]:
def testparseLine(line, k):
    cols = line.split(",")
    # label is the last column
    return cols[k]

cols1 = np.array(raw_data.map(lambda x: testparseLine(x, 1)).distinct().collect())
cols2 = np.array(raw_data.map(lambda x: testparseLine(x, 2)).distinct().collect())
cols3 = np.array(raw_data.map(lambda x: testparseLine(x, 3)).distinct().collect())
print(cols1)
print(cols2)
print(cols3)
['udp' 'tcp' 'icmp']
['whois' 'pop_3' 'harvest' 'uucp_path' 'systat' 'ecr_i' 'other' 'tftp_u'
 'vmnet' 'time' 'remote_job' 'http' 'Z39_50' 'ftp' 'netbios_dgm' 'kshell'
 'name' 'ldap' 'domain_u' 'discard' 'printer' 'eco_i' 'sunrpc' 'http_2784'
 'sql_net' 'private' 'nnsp' 'finger' 'daytime' 'pop_2' 'exec' 'urp_i'
 'iso_tsap' 'IRC' 'ftp_data' 'netbios_ns' 'http_443' 'urh_i' 'bgp' 'ctf'
 'hostnames' 'domain' 'klogin' 'netstat' 'ssh' 'nntp' 'login' 'imap4' 'aol'
 'mtp' 'tim_i' 'rje' 'shell' 'csnet_ns' 'efs' 'smtp' 'X11' 'telnet'
 'pm_dump' 'supdup' 'red_i' 'http_8001' 'courier' 'echo' 'link' 'uucp'
 'netbios_ssn' 'gopher' 'auth' 'ntp_u']
['SH' 'OTH' 'S1' 'S2' 'REJ' 'RSTOS0' 'RSTO' 'S0' 'RSTR' 'S3' 'SF']
In [22]:
def parseLineWithHotEncoding(line):
    cols = line.split(",")
    # label is the last column
    label = cols[-1]
    
    vector = cols[0:-1]
    
    # the binary features that are encoded from the first categorial feature
    featureOfCol1 = np.array([0] * len(cols1))
    featureOfCol1[cols1 == cols[1]] = 1
    # the binary features that are encoded from the second categorial feature
    featureOfCol2 = np.array([0] * len(cols2))
    featureOfCol2[cols2 == cols[2]] = 1
    # the binary features that are encoded from the third categorial feature
    featureOfCol3 = np.array([0] * len(cols3))
    featureOfCol3[cols3 == cols[3]] = 1
    
    # construct the new vector
    #vector = ([np.array(vector[0]) + featureOfCol1 + featureOfCol2 + featureOfCol3 + np.array(vector[4:])])
    vector = np.hstack((vector[0],  featureOfCol1,  featureOfCol2,  featureOfCol3,  vector[4:]))
    
    # convert each value from string to float
    vector = np.array(list(map(lambda x: float(x), vector)))
    
    return (label, vector)
In [23]:
labelsAndData = raw_data.map(parseLineWithHotEncoding)

# we only need the data, not the label
data = labelsAndData.values().cache()
print(data.first())

normalizedData = normalizeData(data).cache()
print("normalized Data = ", normalizedData.take(1))
[  0.00000000e+00   0.00000000e+00   1.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   1.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   1.00000000e+00   2.15000000e+02   4.50760000e+04   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   1.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   1.00000000e+00
   1.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   1.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00]
n =  4898431
sum =  [  2.36802060e+08   1.94288000e+05   1.87059800e+06   2.83354500e+06
   1.07300000e+03   1.98100000e+03   2.00000000e+00   1.05700000e+03
   1.05600000e+03   2.81166000e+06   7.26530000e+04   3.00000000e+00
   1.05300000e+03   1.57900000e+03   1.07300000e+03   6.23091000e+05
   1.07800000e+03   5.21400000e+03   1.05200000e+03   1.04000000e+03
   1.06700000e+03   1.04100000e+03   5.77820000e+04   1.05900000e+03
   1.04500000e+03   1.63380000e+04   1.05600000e+03   1.00000000e+00
   1.05200000e+03   1.10083100e+06   1.03800000e+03   6.89100000e+03
   1.05600000e+03   1.05500000e+03   1.04500000e+03   5.37800000e+03
   1.05200000e+03   5.21000000e+02   4.06970000e+04   1.05400000e+03
   1.04400000e+03   1.48000000e+02   1.04700000e+03   1.06800000e+03
   1.05000000e+03   1.11300000e+03   1.05000000e+03   1.05600000e+03
   1.07500000e+03   1.05900000e+03   1.04500000e+03   1.06900000e+03
   2.00000000e+00   1.07600000e+03   1.20000000e+01   1.07000000e+03
   1.05100000e+03   1.05100000e+03   1.04200000e+03   9.65540000e+04
   1.35000000e+02   4.27700000e+03   5.00000000e+00   1.06000000e+03
   9.00000000e+00   2.00000000e+00   1.02100000e+03   1.05900000e+03
   1.06900000e+03   1.04100000e+03   1.05500000e+03   1.07700000e+03
   3.38200000e+03   3.83300000e+03   1.04000000e+03   5.70000000e+01
   5.32000000e+02   1.61000000e+02   2.68874000e+05   1.22000000e+02
   5.34400000e+03   8.69829000e+05   8.09400000e+03   5.00000000e+01
   3.74432800e+06   8.98676524e+09   5.35703589e+09   2.80000000e+01
   3.17800000e+03   3.90000000e+01   6.09250000e+04   1.57000000e+02
   7.03067000e+05   3.96200000e+04   3.34000000e+02   1.80000000e+02
   6.33610000e+04   5.82300000e+03   3.64000000e+02   5.00200000e+03
   0.00000000e+00   2.00000000e+00   4.09100000e+03   1.64084428e+09
   1.44634545e+09   8.71775140e+05   8.72101730e+05   2.82468470e+05
   2.82786920e+05   3.86919313e+06   1.03746840e+05   1.38433600e+05
   1.14124176e+09   9.26852923e+08   3.69201228e+06   1.50436230e+05
   2.96380553e+06   3.16639800e+04   8.72367200e+05   8.71361620e+05
   2.83755350e+05   2.82440660e+05]
means =  [  4.83424305e+01   3.96633126e-02   3.81876972e-01   5.78459715e-01
   2.19049732e-04   4.04415210e-04   4.08294003e-07   2.15783380e-04
   2.15579233e-04   5.73991958e-01   1.48318921e-02   6.12441004e-07
   2.14966792e-04   3.22348115e-04   2.19049732e-04   1.27202159e-01
   2.20070467e-04   1.06442247e-03   2.14762645e-04   2.12312881e-04
   2.17824850e-04   2.12517028e-04   1.17960220e-02   2.16191674e-04
   2.13333616e-04   3.33535371e-03   2.15579233e-04   2.04147001e-07
   2.14762645e-04   2.24731348e-01   2.11904587e-04   1.40677699e-03
   2.15579233e-04   2.15375086e-04   2.13333616e-04   1.09790257e-03
   2.14762645e-04   1.06360588e-04   8.30817051e-03   2.15170939e-04
   2.13129469e-04   3.02137562e-05   2.13741910e-04   2.18028997e-04
   2.14354351e-04   2.27215613e-04   2.14354351e-04   2.15579233e-04
   2.19458026e-04   2.16191674e-04   2.13333616e-04   2.18233144e-04
   4.08294003e-07   2.19662173e-04   2.44976402e-06   2.18437291e-04
   2.14558498e-04   2.14558498e-04   2.12721175e-04   1.97112096e-02
   2.75598452e-05   8.73136725e-04   1.02073501e-06   2.16395821e-04
   1.83732301e-06   4.08294003e-07   2.08434088e-04   2.16191674e-04
   2.18233144e-04   2.12517028e-04   2.15375086e-04   2.19866320e-04
   6.90425159e-04   7.82495456e-04   2.12312881e-04   1.16363791e-05
   1.08606205e-04   3.28676672e-05   5.48898208e-02   2.49059342e-05
   1.09096158e-03   1.77572982e-01   1.65236583e-03   1.02073501e-05
   7.64393333e-01   1.83462118e+03   1.09362281e+03   5.71611604e-06
   6.48779170e-04   7.96173305e-06   1.24376561e-02   3.20510792e-05
   1.43529020e-01   8.08830419e-03   6.81850985e-05   3.67464602e-05
   1.29349582e-02   1.18874799e-03   7.43095085e-05   1.02114330e-03
   0.00000000e+00   4.08294003e-07   8.35165383e-04   3.34973440e+02
   2.95267086e+02   1.77970281e-01   1.78036953e-01   5.76650911e-02
   5.77301017e-02   7.89884175e-01   2.11796063e-02   2.82608043e-02
   2.32981083e+02   1.89214245e+02   7.53713236e-01   3.07111052e-02
   6.05052012e-01   6.46410657e-03   1.78091148e-01   1.77885862e-01
   5.79278038e-02   5.76594138e-02]
sumSquares =  [  2.57433563e+12   1.94288000e+05   1.87059800e+06   2.83354500e+06
   1.07300000e+03   1.98100000e+03   2.00000000e+00   1.05700000e+03
   1.05600000e+03   2.81166000e+06   7.26530000e+04   3.00000000e+00
   1.05300000e+03   1.57900000e+03   1.07300000e+03   6.23091000e+05
   1.07800000e+03   5.21400000e+03   1.05200000e+03   1.04000000e+03
   1.06700000e+03   1.04100000e+03   5.77820000e+04   1.05900000e+03
   1.04500000e+03   1.63380000e+04   1.05600000e+03   1.00000000e+00
   1.05200000e+03   1.10083100e+06   1.03800000e+03   6.89100000e+03
   1.05600000e+03   1.05500000e+03   1.04500000e+03   5.37800000e+03
   1.05200000e+03   5.21000000e+02   4.06970000e+04   1.05400000e+03
   1.04400000e+03   1.48000000e+02   1.04700000e+03   1.06800000e+03
   1.05000000e+03   1.11300000e+03   1.05000000e+03   1.05600000e+03
   1.07500000e+03   1.05900000e+03   1.04500000e+03   1.06900000e+03
   2.00000000e+00   1.07600000e+03   1.20000000e+01   1.07000000e+03
   1.05100000e+03   1.05100000e+03   1.04200000e+03   9.65540000e+04
   1.35000000e+02   4.27700000e+03   5.00000000e+00   1.06000000e+03
   9.00000000e+00   2.00000000e+00   1.02100000e+03   1.05900000e+03
   1.06900000e+03   1.04100000e+03   1.05500000e+03   1.07700000e+03
   3.38200000e+03   3.83300000e+03   1.04000000e+03   5.70000000e+01
   5.32000000e+02   1.61000000e+02   2.68874000e+05   1.22000000e+02
   5.34400000e+03   8.69829000e+05   8.09400000e+03   5.00000000e+01
   3.74432800e+06   4.34145810e+18   2.03795314e+18   2.80000000e+01
   8.99800000e+03   2.55000000e+02   1.07812100e+06   2.61000000e+02
   7.03067000e+05   7.28519560e+07   3.34000000e+02   3.20000000e+02
   7.59678130e+07   7.55510000e+04   3.74000000e+02   6.18200000e+03
   0.00000000e+00   2.00000000e+00   4.09100000e+03   7.69775149e+11
   7.23474027e+11   8.69483055e+05   8.71016042e+05   2.80516842e+05
   2.81481562e+05   3.79857761e+06   3.57109172e+04   1.00690484e+05
   2.85964840e+11   2.30321984e+11   3.61091538e+06   6.23315491e+04
   2.92650429e+06   8.54361660e+03   8.69554008e+05   8.70465611e+05
   2.77693059e+05   2.77620044e+05]
stdevs =  [  7.23329737e+02   1.95166939e-01   4.85846633e-01   4.93805704e-01
   1.47987077e-02   2.01060105e-02   6.38978745e-04   1.46879821e-02
   1.46810340e-02   4.94494884e-01   1.20879721e-01   7.82585860e-04
   1.46601699e-02   1.79511617e-02   1.47987077e-02   3.33199295e-01
   1.48331398e-02   3.26081197e-02   1.46532086e-02   1.45694133e-02
   1.47572830e-02   1.45764147e-02   1.07967013e-01   1.47018684e-02
   1.46043865e-02   5.76561282e-02   1.46810340e-02   4.51826249e-04
   1.46532086e-02   4.17405282e-01   1.45554005e-02   3.74806345e-02
   1.46810340e-02   1.46740826e-02   1.46043865e-02   3.31164186e-02
   1.46532086e-02   1.03125785e-02   9.07697351e-02   1.46671279e-02
   1.45973986e-02   5.49662108e-03   1.46183523e-02   1.47641952e-02
   1.46392761e-02   1.50719602e-02   1.46392761e-02   1.46810340e-02
   1.48124902e-02   1.47018684e-02   1.46043865e-02   1.47711042e-02
   6.38978745e-04   1.48193766e-02   1.56517028e-03   1.47780099e-02
   1.46462440e-02   1.46462440e-02   1.45834127e-02   1.39006035e-01
   5.24967481e-03   2.95359841e-02   1.01031380e-03   1.47088067e-02
   1.35547764e-03   6.38978745e-04   1.44357419e-02   1.47018684e-02
   1.47711042e-02   1.45764147e-02   1.46740826e-02   1.48262598e-02
   2.62668702e-02   2.79621737e-02   1.45694133e-02   3.41119388e-03
   1.04208641e-02   5.73293877e-03   2.27765073e-01   4.99052240e-03
   3.30116855e-02   3.82152873e-01   4.06157053e-02   3.19487807e-03
   4.24377385e-01   9.41430978e+05   6.45012268e+05   2.39083319e-03
   4.28543324e-02   7.21508295e-03   4.68978117e-01   7.29940683e-03
   3.50611523e-01   3.85648064e+00   8.25714534e-03   8.08243095e-03
   3.93807490e+00   1.24185736e-01   8.73758872e-03   3.55104777e-02
   1.00000000e+00   6.38978745e-04   2.88871577e-02   2.11990761e+02
   2.45992685e+02   3.81875553e-01   3.82254047e-01   2.32252900e-01
   2.32660379e-01   3.89295798e-01   8.27145751e-02   1.40559550e-01
   6.40209308e+01   1.05912757e+02   4.11185973e-01   1.08543203e-01
   4.80987669e-01   4.12597750e-02   3.81838168e-01   3.82177399e-01
   2.30942796e-01   2.30977686e-01]
normalized Data =  [array([ -6.68331854e-02,  -2.03227620e-01,   1.27225957e+00,
        -1.17143182e+00,  -1.48019501e-02,  -2.01141450e-02,
        -6.38979005e-04,  -1.46911522e-02,  -1.46841996e-02,
        -1.16076420e+00,  -1.22699589e-01,  -7.82586340e-04,
        -1.46633220e-02,  -1.79569501e-02,  -1.48019501e-02,
         2.61944684e+00,  -1.48364049e-02,  -3.26428655e-02,
        -1.46563562e-02,  -1.45725073e-02,  -1.47604982e-02,
        -1.45795131e-02,  -1.09255797e-01,  -1.47050475e-02,
        -1.46075028e-02,  -5.78490754e-02,  -1.46841996e-02,
        -4.51826342e-04,  -1.46563562e-02,  -5.38400824e-01,
        -1.45584855e-02,  -3.75334357e-02,  -1.46841996e-02,
        -1.46772437e-02,  -1.46075028e-02,  -3.31528172e-02,
        -1.46563562e-02,  -1.03136755e-02,  -9.15301835e-02,
        -1.46702845e-02,  -1.46005104e-02,  -5.49678716e-03,
        -1.46214776e-02,  -1.47674150e-02,  -1.46424147e-02,
        -1.50753856e-02,  -1.46424147e-02,  -1.46841996e-02,
        -1.48157416e-02,  -1.47050475e-02,  -1.46075028e-02,
        -1.47743284e-02,  -6.38979005e-04,  -1.48226325e-02,
        -1.56517412e-03,  -1.47812387e-02,  -1.46493871e-02,
        -1.46493871e-02,  -1.45865155e-02,  -1.41801106e-01,
        -5.24981950e-03,  -2.95617956e-02,  -1.01031483e-03,
        -1.47119903e-02,  -1.35548013e-03,  -6.38979005e-04,
        -1.44387514e-02,  -1.47050475e-02,  -1.47743284e-02,
        -1.45795131e-02,  -1.46772437e-02,  -1.48295203e-02,
        -2.62850181e-02,  -2.79840711e-02,  -1.45725073e-02,
        -3.41123357e-03,  -1.04219960e-02,  -5.73312720e-03,
        -2.40993143e-01,  -4.99064670e-03,  -3.30477393e-02,
        -4.64664784e-01,  -4.06829284e-02,  -3.19491068e-03,
         5.55181955e-01,  -1.72038228e-03,   6.81884351e-02,
        -2.39084686e-03,  -1.51391734e-02,  -1.10348462e-03,
        -2.65207600e-02,  -4.39091558e-03,   2.44279187e+00,
        -2.09732783e-03,  -8.25770840e-03,  -4.54646139e-03,
        -3.28458917e-03,  -9.57233922e-03,  -8.50457842e-03,
        -2.87561127e-02,   0.00000000e+00,  -6.38979005e-04,
        -2.89113034e-02,  -1.57541507e+00,  -1.19624324e+00,
        -4.66042614e-01,  -4.65755574e-01,  -2.48285775e-01,
        -2.48130352e-01,   5.39733093e-01,  -2.56056520e-01,
        -2.01059296e-01,  -3.63913926e+00,  -1.78651044e+00,
        -1.83302273e+00,  -2.82939000e-01,  -1.25793664e+00,
        -1.56668488e-01,  -4.66404784e-01,  -4.65453641e-01,
        -2.50831829e-01,  -2.49631966e-01])]

2.3.2. Building models

Question 11

Using the new data, cluster the connections with different values of $k \in [80,90,100,110,120,130,140,150,160]$. Evaluate the results and choose the best value of $k$ as previous questions.
In [32]:
kList = range(80,161,10)
scores = list(map(lambda xk: (xk , clusteringScore(normalizedData, xk)), kList))
/opt/spark/python/pyspark/mllib/clustering.py:347: UserWarning:

The param `runs` has no effect since Spark 2.0.0.

1.43921545452
1.31055972898
1.21564755766
1.24053003769
1.1027274355
1.21412836334
1.09840907092
1.13197126391
1.03432902661
In [19]:
Scores = [[80,1.43921545452],[90,1.31055972898],[100,1.21564755766],[110,1.24053003769],[120,1.1027274355],[130,1.21412836334]\
          ,[140,1.09840907092],[150,1.13197126391],[160,1.03432902661]]
In [20]:
npScores = np.array(Scores)
trace0 = go.Bar(
    x=npScores[:,0],
    y=npScores[:,1],
    marker=dict(
        color='rgb(158,202,225)',
        line=dict(
            color='rgb(8,48,107)',
            width=1.5,
        )
    ),
    opacity=0.6
)

data3 = [trace0]
layout = go.Layout(
    title='Figure 3: Mean squared error for each k after normalization and  categorical features',
    xaxis=dict(
        title='Number of K'
    ),
    yaxis=dict(
        title='Mean Square Error'
    )
)

fig = go.Figure(data=data3, layout=layout)
iplot(fig, filename='text-hover-bar')
PUT YOUR ANSWER HERE !!!

After normalizing data, we can see the little change in number of k compared to previous results. The graph shows that in this method, the best k-value is 120 when at the fig 1 and fig 2 the k-values are 115 and 110 respectively

2.4. Anomaly detection

When we have a new connection data (e.g., one that we never saw before), we simply find the closest cluster for it, and use this information as a proxy to indicate whether the data point is anomalous or not. A simple approach to decide when there is an anomaly or not, amounts to measuring the new data point’s distance to its nearest centroid. If this distance exceeds some thresholds, it is anomalous.

Question 12

Build your model with the best value of $k$ in your opinion. Then, detect the anomalous connections in our data. Plot and discuss your result.

HINT
The threshold has strong impact on the result. Be careful when choosing it! A simple way to choose the threshold's value is picking up a distance of a data point from among known data. For example, the 100th-farthest data point distance can be an option.

In [24]:
clusters = KMeans.train(normalizedData, 160, maxIterations=10, runs=10, initializationMode="random")
# calculate mean square error
score = normalizedData.map(lambda point: error(clusters, point))
anoThreshold = score.top(100)[-1]
print(anoThreshold)
print(score.top(1)[0])
/opt/spark/python/pyspark/mllib/clustering.py:347: UserWarning:

The param `runs` has no effect since Spark 2.0.0.

341.511978398
2680.57585836

We build our model with k = 160

We can see that the largest distance from a point to its closest center is: 2680.57585836

We choose the distance from 100th-farthest data point to its closest center to become our threashold. In this case threshold = 341.511978398

If the distance of a new data point to its closest center bigger than the threshold, it is anomalous and need to trigger an alert and update our database.

In [25]:
#We can calculate again means and stdevs for use in our function that map from raw data  to error for ranking and filtering

n = data.count()
#print("n = ", n)

# calculate the sum of each feature
sums = data.reduce(lambda x,y: x + y)
#print("sum = ", sums)

# calculate means
means = sums/n
#print("means = " , means)

# calculate the sum square of each feature
sumSquares = data.map(lambda x: x**2).reduce(lambda x,y:x+y) 
#print("sumSquares = ", sumSquares)

# calculate standard deviation of each feature
stdevs = np.sqrt((sumSquares/n - means**2))
stdevs[stdevs <= 0] = 1
#print("stdevs = ", stdevs)
In [26]:
#We can to map from each raw data line to its error for ranking and filtering

def errorFromRawData(clusters, line):
    cols = line.split(",")
    # label is the last column
    label = cols[-1]
    
    vector = cols[0:-1]
    
    # the binary features that are encoded from the first categorial feature
    featureOfCol1 = np.array([0] * len(cols1))
    featureOfCol1[cols1 == cols[1]] = 1
    # the binary features that are encoded from the second categorial feature
    featureOfCol2 = np.array([0] * len(cols2))
    featureOfCol2[cols2 == cols[2]] = 1
    # the binary features that are encoded from the third categorial feature
    featureOfCol3 = np.array([0] * len(cols3))
    featureOfCol3[cols3 == cols[3]] = 1
    
    # construct the new vector
    vector = np.hstack((vector[0],  featureOfCol1,  featureOfCol2,  featureOfCol3,  vector[4:]))
    
    # convert each value from string to float
    vector = np.array(list(map(lambda x: float(x), vector)))
    
    normalizedvector = (vector - means) / stdevs
    closest_center = clusters.centers[clusters.predict(normalizedvector)]
    return (euclidean_distance(normalizedvector, closest_center), line)
In [27]:
raw_data_error = raw_data.map(lambda line: errorFromRawData(clusters, line))
filteredData = raw_data_error.filter(lambda line: line[0] >= anoThreshold)
print(filteredData.count())
100

Because we used the distance from 100th-farthest data point to its closest center to become our threashold. And we also use this threshold to filter own old data set. So finally, we get the filtered dataset with 100 anomalous data point

We try to find the most anomalous data point in our data set.

In [28]:
#The most anomalous data point 
top1 = filteredData.top(1, key = lambda x: x[0])
print(top1)
[(2680.575858363381, '5328,tcp,telnet,SF,1604,920608,0,0,0,2,0,1,7479,0,0,7468,2,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,20,8,0.40,0.10,0.05,0.00,0.00,0.00,0.00,0.00,normal.')]

A network security expert would be more able to interpret why this is or is not actually a strange connection. It appears unusual at least because it is labeled "normal", but far away from other "normal" data points.

We try to plot these top 100 anomalous data points

In [29]:
plotDatas = filteredData.map(lambda x: x[1]).cache()
labelsAndData2 = plotDatas.map(parseLineWithHotEncoding).cache()
data2 = labelsAndData2.values().cache()
normalizedData2 = normalizeData(data2).cache()
n =  100
sum =  [  4.28169000e+05   3.00000000e+00   7.60000000e+01   2.10000000e+01
   0.00000000e+00   0.00000000e+00   2.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   1.00000000e+00   3.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   2.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   1.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   1.00000000e+00
   0.00000000e+00   3.00000000e+00   0.00000000e+00   2.80000000e+01
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   2.00000000e+00   0.00000000e+00   1.20000000e+01   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   2.90000000e+01   5.00000000e+00   0.00000000e+00
   9.00000000e+00   2.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   2.00000000e+00   0.00000000e+00   5.00000000e+00   4.00000000e+00
   3.00000000e+00   3.00000000e+01   3.00000000e+00   0.00000000e+00
   5.30000000e+01   4.24417948e+09   2.12096702e+09   2.80000000e+01
   0.00000000e+00   2.50000000e+01   1.10000000e+01   3.60000000e+01
   2.30000000e+01   1.74400000e+04   1.50000000e+01   2.80000000e+01
   1.83710000e+04   6.10000000e+01   0.00000000e+00   6.90000000e+01
   0.00000000e+00   2.00000000e+00   1.00000000e+00   4.61100000e+03
   1.69000000e+02   3.09000000e+01   3.26700000e+01   1.85300000e+01
   1.43300000e+01   8.15000000e+01   1.60000000e+01   2.30000000e+01
   8.92200000e+03   5.80000000e+02   4.10300000e+01   1.64300000e+01
   4.09600000e+01   1.95000000e+01   2.97500000e+01   2.43500000e+01
   1.34300000e+01   1.27700000e+01]
means =  [  4.28169000e+03   3.00000000e-02   7.60000000e-01   2.10000000e-01
   0.00000000e+00   0.00000000e+00   2.00000000e-02   0.00000000e+00
   0.00000000e+00   0.00000000e+00   1.00000000e-02   3.00000000e-02
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   2.00000000e-02   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   1.00000000e-02
   0.00000000e+00   0.00000000e+00   0.00000000e+00   1.00000000e-02
   0.00000000e+00   3.00000000e-02   0.00000000e+00   2.80000000e-01
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   2.00000000e-02   0.00000000e+00   1.20000000e-01   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   2.90000000e-01   5.00000000e-02   0.00000000e+00
   9.00000000e-02   2.00000000e-02   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   2.00000000e-02   0.00000000e+00   5.00000000e-02   4.00000000e-02
   3.00000000e-02   3.00000000e-01   3.00000000e-02   0.00000000e+00
   5.30000000e-01   4.24417948e+07   2.12096701e+07   2.80000000e-01
   0.00000000e+00   2.50000000e-01   1.10000000e-01   3.60000000e-01
   2.30000000e-01   1.74400000e+02   1.50000000e-01   2.80000000e-01
   1.83710000e+02   6.10000000e-01   0.00000000e+00   6.90000000e-01
   0.00000000e+00   2.00000000e-02   1.00000000e-02   4.61100000e+01
   1.69000000e+00   3.09000000e-01   3.26700000e-01   1.85300000e-01
   1.43300000e-01   8.15000000e-01   1.60000000e-01   2.30000000e-01
   8.92200000e+01   5.80000000e+00   4.10300000e-01   1.64300000e-01
   4.09600000e-01   1.95000000e-01   2.97500000e-01   2.43500000e-01
   1.34300000e-01   1.27700000e-01]
sumSquares =  [  1.10784348e+10   3.00000000e+00   7.60000000e+01   2.10000000e+01
   0.00000000e+00   0.00000000e+00   2.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   1.00000000e+00   3.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   2.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   1.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   1.00000000e+00
   0.00000000e+00   3.00000000e+00   0.00000000e+00   2.80000000e+01
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   2.00000000e+00   0.00000000e+00   1.20000000e+01   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   2.90000000e+01   5.00000000e+00   0.00000000e+00
   9.00000000e+00   2.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   2.00000000e+00   0.00000000e+00   5.00000000e+00   4.00000000e+00
   3.00000000e+00   3.00000000e+01   3.00000000e+00   0.00000000e+00
   5.30000000e+01   4.28022132e+18   2.03642599e+18   2.80000000e+01
   0.00000000e+00   2.35000000e+02   2.50000000e+01   1.26000000e+02
   2.30000000e+01   6.55127100e+07   1.50000000e+01   5.60000000e+01
   6.67251890e+07   1.94700000e+03   0.00000000e+00   4.69000000e+02
   0.00000000e+00   2.00000000e+00   1.00000000e+00   2.11068900e+06
   4.79000000e+02   2.93188000e+01   3.24489000e+01   1.69483000e+01
   1.41089000e+01   7.99474000e+01   1.52348000e+01   2.30000000e+01
   1.74177000e+06   1.14640000e+04   3.74153000e+01   1.14365000e+01
   3.86644000e+01   1.70008000e+01   2.72341000e+01   2.10143000e+01
   1.01411000e+01   1.15381000e+01]
stdevs =  [  9.61516921e+03   1.70587221e-01   4.27083130e-01   4.07308237e-01
   1.00000000e+00   1.00000000e+00   1.40000000e-01   1.00000000e+00
   1.00000000e+00   1.00000000e+00   9.94987437e-02   1.70587221e-01
   1.00000000e+00   1.00000000e+00   1.00000000e+00   1.00000000e+00
   1.00000000e+00   1.40000000e-01   1.00000000e+00   1.00000000e+00
   1.00000000e+00   1.00000000e+00   1.00000000e+00   9.94987437e-02
   1.00000000e+00   1.00000000e+00   1.00000000e+00   9.94987437e-02
   1.00000000e+00   1.70587221e-01   1.00000000e+00   4.48998886e-01
   1.00000000e+00   1.00000000e+00   1.00000000e+00   1.00000000e+00
   1.00000000e+00   1.00000000e+00   1.00000000e+00   1.00000000e+00
   1.00000000e+00   1.00000000e+00   1.00000000e+00   1.00000000e+00
   1.00000000e+00   1.00000000e+00   1.00000000e+00   1.00000000e+00
   1.00000000e+00   1.00000000e+00   1.00000000e+00   1.00000000e+00
   1.40000000e-01   1.00000000e+00   3.24961536e-01   1.00000000e+00
   1.00000000e+00   1.00000000e+00   1.00000000e+00   1.00000000e+00
   1.00000000e+00   4.53762052e-01   2.17944947e-01   1.00000000e+00
   2.86181760e-01   1.40000000e-01   1.00000000e+00   1.00000000e+00
   1.00000000e+00   1.00000000e+00   1.00000000e+00   1.00000000e+00
   1.00000000e+00   1.00000000e+00   1.00000000e+00   1.00000000e+00
   1.40000000e-01   1.00000000e+00   2.17944947e-01   1.95959179e-01
   1.70587221e-01   4.58257569e-01   1.70587221e-01   1.00000000e+00
   4.99099189e-01   2.02486808e+08   1.41118425e+08   4.48998886e-01
   1.00000000e+00   1.51244835e+00   4.87749936e-01   1.06320271e+00
   4.20832508e-01   7.90387082e+02   3.57071421e-01   6.93974063e-01
   7.95928719e+02   4.37011441e+00   1.00000000e+00   2.05277860e+00
   1.00000000e+00   1.40000000e-01   9.94987437e-02   1.37770671e+02
   1.39064733e+00   4.44642553e-01   4.66643451e-01   3.67623326e-01
   3.47209029e-01   3.67762151e-01   3.56016854e-01   4.20832508e-01
   9.72496355e+01   9.00000000e+00   4.53659465e-01   2.95585030e-01
   4.67837408e-01   3.63294646e-01   4.28759548e-01   3.88395095e-01
   2.88746446e-01   3.14759766e-01]
In [30]:
def PlotFigureDataWithoutCenter(data, rate, clusters):
    sampledData = data.sample(False, rate)
    #print(sampledData.count())

    clusterCentroids = sc.parallelize(clusters.centers)
    #sampledDataCentroids = sampledData + clusterCentroids
    sampledDataCentroids = sampledData
    arraysampledDataCentroids = np.array(sampledDataCentroids.take(sampledDataCentroids.count()))
    print(arraysampledDataCentroids.shape)
    
    Y_labels = sampledDataCentroids.map(lambda x: clusters.predict(x))
    Y_labels = np.array(Y_labels.take(Y_labels.count()))
    
    pca = PCA(n_components=3)
    score = pca.fit_transform(arraysampledDataCentroids)
    PCA(copy=True, n_components=3, whiten=False)

    #plot the clusters in color
    fig = plt.figure(1, figsize=(8, 8))
    plt.clf()
    ax = Axes3D(fig, rect=[0, 0, 1, 1], elev=8, azim=200)
    plt.cla()
    ax.scatter(score[:,0],score[:,1], score[:,2], c=Y_labels)

    ax.w_xaxis.set_ticklabels([])
    ax.w_yaxis.set_ticklabels([])
    ax.w_zaxis.set_ticklabels([])
    ax.set_xlabel('X ')
    ax.set_ylabel('Y')
    ax.set_zlabel('Z')

    plt.show()

Plot top 100 anomalous data points only

In [41]:
PlotFigureDataWithoutCenter(normalizedData2, 1, clusters)
(100, 122)
/opt/conda/lib/python3.5/site-packages/matplotlib/collections.py:590: FutureWarning:

elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison

Plot top 100 anomalous data points and 160 centers of the model.

In [43]:
PlotData(normalizedData2, 1, clusters)
(178, 122)
/opt/conda/lib/python3.5/site-packages/matplotlib/collections.py:590: FutureWarning:

elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison

We can see they when we add centers to the plot, top-100 anomalous data points are quite far away from other data points. So that, instead of seeing separated points like the first figure, we only see points focuses to form a cluster and become a small set in the plot, far away from other centers.

Question 13

Try other methods to find the best value for $k$ such as `silhouette`, `entropy`... In particular, with this data, you can take advantage of predefined labels to calculate the quality of model using entropy... However, we suggest you to try with `silhouette`. It's more general and can work with any dataset (with and without predefined labels).

Here are some additional information about the metrics we suggest to use:

Note
you are free to play with any relevant evaluation metric you think appropriate for your work!

In [31]:
def calEntropy(listValue):
    ent = 0
    n = len(listValue)
    #print(n)
    for x in set(listValue):
        ent -= listValue.count(x)/n *  np.log(listValue.count(x)/n);
    return ent*n
In [32]:
# k: the number of clusters
def clusteringScoreEntropy(labels, data, k):
    print(k)
    clusters = KMeans.train(data, k, maxIterations=10, runs=10, initializationMode="random")
    # calculate mean square error
    closest_centerss = data.map(lambda x: clusters.predict(x))
    closest_center_label = closest_centerss.zip(labels)
    entScore = closest_center_label.groupByKey().mapValues(list).map(lambda x: calEntropy(x[1])).sum()/closest_center_label.count()
    print(entScore)
    return entScore
In [47]:
kList = range(80, 161, 10)
entScores = list(map(lambda xk: (xk , clusteringScoreEntropy(labels_data, normalizedData, xk)), kList))
80
/opt/spark/python/pyspark/mllib/clustering.py:347: UserWarning:

The param `runs` has no effect since Spark 2.0.0.

0.0220152627366
90
0.0257946195467
100
0.0230592355487
110
0.0249077107456
120
0.0179280472809
130
0.0165599799982
140
0.0183107227647
150
0.0146660986153
160
0.0181881936621
In [21]:
Scores = [[80,0.0220152627366],[90,0.0257946195467],[100,0.0230592355487],[110,0.0249077107456],[120,0.0179280472809],[130,0.0165599799982]\
          ,[140,0.0183107227647],[150,0.0146660986153],[160,0.0181881936621]]
In [22]:
npScores = np.array(Scores)
trace0 = go.Bar(
    x=npScores[:,0],
    y=npScores[:,1],
    marker=dict(
        color='rgb(158,202,225)',
        line=dict(
            color='rgb(8,48,107)',
            width=1.5,
        )
    ),
    opacity=0.6
)

data4 = [trace0]
layout = go.Layout(
    title='Figure 4: Calculation of Entropy for each k',
    xaxis=dict(
        title='Number of K'
    ),
    yaxis=dict(
        title='Entropy'
    )
)

fig = go.Figure(data=data4, layout=layout)
iplot(fig, filename='text-hover-bar')

When we use Entropy to score the performance of our clustering, a good clustering would have clusters whose collections of labels are homogeneous and so have low entropy. We average each cluster entropy weighted by its cluster size.

As more clusters are added, it should always be possible to make clusters have its collections of labels more homogeneous. In fact, if k is chosen to equal the number of data points, the average entropy will be 0, because every point will be its own cluster of one.

We also try to use Silhouette Score to performance our cluster.

In [33]:
from sklearn.metrics import silhouette_samples, silhouette_score

# k: the number of clusters
def clusteringScoreSilhouett (data, k):
    print(k)
    clusters = KMeans.train(data, k, maxIterations=10, runs=10, initializationMode="random")
    # calculate mean square error
    closest_centerss = data.map(lambda x: clusters.predict(x))
    
    convertedData = np.array(data.take(data.count()))
    convertedlabel = np.array(closest_centerss.take(closest_centerss.count()))

    silScore = silhouette_score(convertedData, convertedlabel)
    
    print(silScore)
    return silScore
In [34]:
kList = range(80, 161, 10)
sampledNormalizedData = normalizedData.sample(False, 0.001)
silScore = list(map(lambda xk: (xk , clusteringScoreSilhouett(sampledNormalizedData, xk)), kList))
80
/opt/spark/python/pyspark/mllib/clustering.py:347: UserWarning:

The param `runs` has no effect since Spark 2.0.0.

/opt/conda/lib/python3.5/site-packages/numpy/core/_methods.py:59: RuntimeWarning:

Mean of empty slice.

0.734957043125
90
0.678320649238
100
0.679236365658
110
0.685957339385
120
0.702924835775
130
0.699519765653
140
0.733915148418
150
0.696458797617
160
0.729932417629
In [23]:
silScore = [[80,0.734957043125],[90,0.678320649238],[100,0.679236365658],[110,0.685957339385],[120,0.702924835775] \
            ,[130,0.699519765653],[140,0.733915148418],[150,0.696458797617],[160,0.729932417629]]
In [24]:
npScores = np.array(silScore)
trace0 = go.Bar(
    x=npScores[:,0],
    y=npScores[:,1],
    marker=dict(
        color='rgb(158,202,225)',
        line=dict(
            color='rgb(8,48,107)',
            width=1.5,
        )
    ),
    opacity=0.6
)

data5 = [trace0]
layout = go.Layout(
    title='Figure 5: Silhouette Score for each k',
    xaxis=dict(
        title='Number of K'
    ),
    yaxis=dict(
        title='Silhouette Score'
    )
)

fig = go.Figure(data=data5, layout=layout)
iplot(fig, filename='text-hover-bar')

Because Silhouette scores need many much computation (distance from each point to its closest center and other center also). So we just compute 0.1% samples of our data set to calculate Silhouette scores.

The silhouette ranges from -1 to 1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters.

It seems that k = 80 is the best k, which is followed by the silhouette characteristics. (The score is higher when clusters are dense and well separated, which relates to a standard concept of a cluster.). But remember that this is just a result on 0.01% samples on over dataset. In our opinion, we prefer the number of k is 140, because in the previous experiments demonstrate that the k-value should be larger than 110 intuitively.

Question 14

Implement K-means on Spark so that It can work with large datasets in parallel. Test your algorithm with our dataset in this notebook. Compare our algorithm with the algorithm from MLLIB.
    Let's clarify the meaning of this question: what we want is for students to design the K-means algorithm for the parallel programming model exposed by Spark. You are strongly invited to use the Python API (pyspark). So, at the end of the day, you will operate on RDDs, and implement a `map/reduce` algorithm that performs the two phases of the standard K-means algorithm, i.e. the assignment step and the update step.